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One must learn by doing the thing; 
for though you think you know it 
You have no certainty, until you try. 
—Sophocles, Trachiniae 


PREFACE 





The principal goal of this book is to provide a unified introduction to the theory, implementation, and applications 
of statistical and adaptive signal processing methods. We have focused on the key topics of spectral estimation, 
signal modeling, adaptive filtering, whose selection was based on the grounds of theoretical value and practical 
importance. The book has been primarily written with students and instructors in mind. The principal objectives 
are to provide an introduction to basic concepts and methodologies that can provide the foundation for further 
study, research, and application to new problems. To achieve these goals, we have focused on topics that we 
consider fundamental and have either multiple or important applications. 


Approach and prerequisites 


The adopted approach is intended to help both students and practicing engineers understand the fundamental 
mathematical principles underlying the operation of a method, appreciate its inherent limitations, and provide 
sufficient details for its practical implementation. The academic flavor of this book has been influenced by our 
teaching whereas its practical character has been shaped by our research and development activities in both 
academia and industry. The mathematical treatment throughout this book has been kept at a level that is within the 
grasp of upper-level undergraduate students, graduate students, and practicing electrical engineers with a 
background in digital signal processing, probability theory, and linear algebra. 


Organization of the book 


Chapter 1 introduces the basic concepts and applications of statistical and adaptive signal processing and provides 
an overview of the book. Chapters 2 introduce some basic concepts of estimation theory. Chapter 3 provides a 
treatment of parametric linear signal models (both deterministic and stochastic) in the time and frequency domains. 
Chapter 4 presents the most practical methods for the estimation of correlation and spectral densities. Chapter 5 
provides a detailed study of the theoretical properties of optimum filters, assuming that the relevant signals can be 
modeled as stochastic processes with known statistical properties; and Chapter 6 contains algorithms and 
structures for optimum filtering, signal modeling, and prediction. Chapter 7 introduces the principle of 
least-squares estimation and its application to the design of practical filters and predictors. Chapters 8 and 9 use 
the theoretical work in Chapters 3, 5 and 6 and the practical methods in Chapter 7 to develop, evaluate, and apply 
practical techniques for signal modeling, adaptive filtering, 


Theory and practice 


It is our belief that sound theoretical understanding goes hand-in-hand with practical implementation and 
application to real-world problems. Therefore, the book includes a large number of computer experiments that 
illustrate important concepts and help the reader to easily implement the various methods. Every chapter includes 
examples, problems, and computer experiments that facilitate the comprehension of the material. To help the 
reader understand the theoretical basis and limitations of the various methods and apply them to real-world 
problems, we provide MATLAB functions for all major algorithms and examples illustrating their use. 


Feedback 


Although we are fully aware that there always exists room for improvement, we believe that this book is a big step 
forward for an introductory textbook in statistical and adaptive signal processing. However, as engineers, we 
know that every search for the optimum requires the will to change and quest for additional improvement. Thus, 
we would appreciate feedback from teachers, students, and engineers using this book for self-study at 
vingle@lynx.neu. edu. 
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CHAPTER 1 


Introduction 


This book is an introduction to the theory and algorithms used for the analysis and processing of random signals and 
their applications to real-world problems. The fundamental characteristic of random signals is captured in the 
following statement: Although random signals are evolving in time in an unpredictable manner, their average 
statistical properties exhibit considerable regularity. This provides the ground for the description of random signals 
using statistical averages instead of explicit equations. When we deal with random signals, the main objectives are the 
statistical description, modeling, and exploitation of the dependence between the values of one or more discrete-time 
signals and their application to theoretical and practical problems. 

Random signals are described mathematically by using the theory of probability, random variables, and 
stochastic processes. However, in practice we deal with random signals by using statistical techniques. Within this 
framework we can develop, at least in principle, theoretically optimum signal processing methods that can inspire the 
development and can serve to evaluate the performance of practical statistical signal processing techniques. The area 
of adaptive signal processing involves the use of optimum and statistical signal processing techniques to design 
signal processing systems that can modify their characteristics, during normal operation (usually in real time), to 
achieve a clearly predefined application-dependent objective. 

The purpose of this chapter is twofold: to illustrate the nature of random signals with some typical examples and 
to introduce the three major application areas treated in this book: spectral estimation, signal modeling and adaptive 
filtering. Throughout the book, the emphasis is on the application of techniques to actual problems in which the 
theoretical framework provides a foundation to motivate the selection of a specific method. 


1.1 Random Signals 


A discrete-time signal or time series is a set of observations taken sequentially in time, space, or some other 
independent variable. Examples occur in various areas, including engineering, natural sciences, economics, social 
sciences, and medicine. 

A discrete-time signal x(n) is basically a sequence of real or complex numbers called samples. Although the 
integer index n may represent any physical variable (e.g., time, distance), we shall generally refer to it as time. 
Furthermore, in this book we consider only time series with observations occurring at equally spaced intervals of time. 

Discrete-time signals can arise in several ways. Very often, a discrete-time signal is obtained by periodically 
sampling a continuous-time signal, that is, x(m)=x,(nT), where T =1/F, (seconds) is the sampling period and 
F, (samples per second or hertz) is the sampling frequency. At other times, the samples of a discrete-time signal are 
obtained by accumulating some quantity (which does not have an instantaneous value) over equal intervals of time, 
for example, the number of cars per day traveling on a certain road. Finally, some signals are inherently discrete-time, 
for example, daily stock market prices. Throughout the book, except if otherwise stated, the terms signal, time series, 
or sequence will be used to refer to a discrete-time signal. 

The key characteristics of a time series are that the observations are ordered in time and that adjacent 
observations are dependent (related). To see graphically the relation between the samples of a signal that are / 
sampling intervals away, we plot the points {x(n),x(n+l)} for O<n<N-1-/, where N is the length of the 
data record. The resulting graph is known as the Z lag scatter plot. This is illustrated in Figure 1.1, which shows a 
speech signal and two scatter plots that demonstrate the correlation between successive samples. We note that for 
adjacent samples the data points fall close to a straight line with a positive slope. This implies high correlation 
because every sample is followed by a sample with about the same amplitude. In contrast, samples that are 20 
sampling intervals apart are much less correlated because the points in the scatter plot are randomly spread. 

When successive observations of the series are dependent, we may use past observations to predict future values. 
If the prediction is exact, the series is said to be deterministic. However, in most practical situations we cannot predict 
a time series exactly. Such time series are called random or stochastic, and the degree of their predictability is 
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determined by the dependence between consecutive observations. The ultimate case of randomness occurs when 
every sample of a random signal is independent of all other samples. Such a signal, which is completely unpredictable, 
is known as white noise and is used as a building block to simulate random signals with different types of dependence. 
To summarize, the fundamental characteristic of a random signal is the inability to precisely specify its values. In 
other words, a random signal is not predictable, it never repeats itself, and we cannot find a mathematical formula that 
provides its values as a function of time. As a result, random signals can only be mathematically described by using 
the theory of stochastic processes. 
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FIGURE 1.1 
(a) The waveform for the speech signal “signal”; (b) two scatter plots for successive samples and samples separated by 20 sampling 
intervals. 
This book provides an introduction to the fundamental theory and a broad selection of algorithms widely used 
for the processing of discrete-time random signals. Signal processing techniques, dependent on their main objective, 
can be classified as follows (see Figure 1.2): 





Random signals 







Theory of stochastic 
processes, 
Analysis estimation, and Filtering 
optimum filtering 
(Chapters 2, 5, 6) 





Spectral 
estimation 
(Chapters 4, 8) 





Signal modeling 
(Chapters 3) 


Adaptivefiltering 
(Chapters 7, 9) 





Array processing 


e Signal analysis. The primary goal is to extract useful information that can be used to understand the signal 
generation process or extract features that can be used for signal classification purposes. Most of the methods in this 





FIGURE 1.2 
Classification of methods for the analysis and processing of random signals. 
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area are treated under the disciplines of spectral estimation and signal modeling. Typical applications include 
detection and classification of radar and sonar targets, speech and speaker recognition, detection and classification of 
natural and artificial seismic events, event detection and classification in biological and financial signals, efficient 
signal representation for data compression, etc. 

e Signal filtering. The main objective of signal filtering is to improve the quality of a signal according to an 
acceptable criterion of performance. Signal filtering can be subdivided into the areas of frequency selective 
filtering and adaptive filtering. Typical applications include noise and interference cancelation, echo cancelation, 
channel equalization, seismic deconvolution, active noise control, etc. 


We conclude this section with some examples of signals occurring in practical applications. Although the description of 
these signals is far from complete, we provide sufficient information to illustrate their random nature and significance in 
signal processing applications. 

Speech signals. Figure 1.3 shows the spectrogram and speech waveform corresponding to the utterance “signal.” 
The spectrogram is a visual representation of the distribution of the signal energy as a function of time and frequency. 
We note that the speech signal has significant changes in both amplitude level and spectral content across time. The 
waveform contains segments of voiced (quasi-periodic) sounds, such as “e,” and unvoiced or fricative (noiselike) 
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sounds, such as “g. 
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FIGURE 1.3 
Spectrogram and acoustic waveform for the utterance “signal.” The horizontal dark bands show the resonances of the vocal tract, which 
change as a function of time depending on the sound or phoneme being produced. 

Speech production involves three processes: generation of the sound excitation, articulation by the vocal tract, 
and radiation from the lips and/or nostrils. If the excitation is a quasi-periodic train of air pressure pulses, produced by 
the vibration of the vocal cords, the result is a voiced sound. Unvoiced sounds are produced by first creating a 
constriction in the vocal tract, usually toward the mouth end. Then we generate turbulence by forcing air through the 
constriction at a sufficiently high velocity. The resulting excitation is a broadband noiselike waveform. 

The spectrum of the excitation is shaped by the vocal tract tube, which has a frequency response that resembles 
the resonances of organ pipes or wind instruments. The resonant frequencies of the vocal tract tube are known as 
formant frequencies, or simply formants. Changing the shape of the vocal tract changes its frequency response and 
results in the generation of different sounds. Since the shape of the vocal tract changes slowly during continuous 
speech, we usually assume that it remains almost constant over intervals on the order of 10 ms. More details about 
speech signal generation and processing can be found in Rabiner and Schafer 1978; O’Shaughnessy 1987; and 
Rabiner and Juang 1993. 


Other examples of the stochastic signals are Electrophysiological signals, geophysical signals, radar signals. 
1.2 Spectral Estimation 


The central objective of signal analysis is the development of quantitative techniques to study the properties of a 
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signal and the differences and similarities between two or more signals from the same or different sources. The major 
areas of random signal analysis are (1) statistical analysis of signal amplitude (i.e., the sample values); (2) analysis 
and modeling of the correlation among the samples of an individual signal; and (3) joint signal analysis (i.e., 
simultaneous analysis of two signals in order to investigate their interaction or interrelationships). These techniques 
are summarized in Figure 1.4. The prominent tool in signal analysis is spectral estimation, which is a generic term for 
a multitude of techniques used to estimate the distribution of energy or power of a signal from a set of observations. 
Spectral estimation is a very complicated process that requires a deep understanding of the underlying theory and a 
great deal of practical experience. Spectral analysis finds many applications in areas such as medical diagnosis, 
speech analysis, seismology and geophysics, radar and sonar, nondestructive fault detection, testing of physical 
theories, and evaluating the predictability of time series. 
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Higher-order statistics Higher-order statistics 





FIGURE 1.4 
Summary of random signal analysis techniques. 


Amplitude distribution. The range of values taken by the samples of a signal and how often the signal assumes 
these values together determine the signal variability. The signal variability can be seen by plotting the time series 
and is quantified by the histogram of the signal samples, which shows the percentage of the signal amplitude values 
within a certain range. The numerical description of signal variability, which depends only on the value of the signal 
samples and not on their ordering, involves quantities such as mean value, median, variance, and dynamic range. 

Figure 1.5 shows the one-step increments, that is, the first difference x,(n)=x(n)—x(n—1) whereas Figure 
1.6 shows their histograms. Careful examination of the shape of the histogram curves indicates that the second signal 
jumps quite frequently between consecutive samples with large steps. In other words, the probability of large 
increments is significant, as exemplified by the fat tails of the histogram in Figure 1.6(b). The knowledge of the 
probability of extreme values is essential in the design of detection systems for digital communications, military 
surveillance using infrared and radar sensors, and intensive care monitoring. In general, the shape of the histogram, or 
more precisely the probability density, is very important in applications such as signal coding and event detection. 
Although many practical signals follow a Gaussian distribution, many other signals of practical interest have 
distributions that are non-Gaussian. For example, speech signals have a probability density that can be reasonably 
approximated by a gamma distribution (Rabiner and Schafer 1978). 

The significance of the Gaussian distribution in signal processing stems from the following facts. First, many 
physical signals can be described by Gaussian processes. Second, the central limit theorem states that any process that 
is the result of the combination of many elementary processes will tend, under quite general conditions, to be 
Gaussian. Finally, linear systems preserve the Gaussianity of their input signals. To understand the last two 
statements, consider N independent random quantities x,,x,,...,xX,y With the same probability density p(x) and pose 
the following question: When does the probability distribution p)(x) of their sum x= x,+x,+-+-+xX, have the 
same shape (within a scale factor) as the distribution p(x) of the individual quantities? The standard answer is that 
p(x) should be Gaussian, because the sum of N Gaussian random variables is again a Gaussian, but with variance 
equal to N times that of the individual signals. However, if we allow for distributions with infinite variance, additional 
solutions are possible. The resulting probability distributions, known as stable or Levy distributions, have infinite 
variance and are characterized by a thin main lobe and fat tails, resembling the shape of the histogram in 
Figure 1.6(b). Interestingly enough, the Gaussian distribution is a stable distribution with finite variance (actually the 
only one). Because Gaussian and stable non-Gaussian distributions are invariant under linear signal processing 
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operations, they are very important in signal processing. 
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FIGURE 1.5 


One-step-increment time series for the infrared data. 
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(a) Infrared | (b) Infrared 2 
FIGURE 1.6 
Histograms for the infrared increment signals. 


Correlation and spectral analysis. Although scatter plots (see Figure 1.1) illustrate nicely the existence of 
correlation, to obtain quantitative information about the correlation structure of a time series x(n) with zero mean 
value, we use the empirical normalized autocorrelation sequence 


N-I 
$ x(n)x" (n=) 
pl) = a (1.2.1) 


N- 


Zloof 


which is an estimate of the theoretical normalized autocorrelation sequence. For lag ]=0, the sequence is perfectly 
correlated with itself and we get the maximum value of 1. If the sequence does not change significantly from sample 
to sample, the correlation of the sequence with its shifted copies, though diminished, is still close to 1. Usually, the 
correlation decreases as the lag increases because distant samples become less and less dependent. Note that 
reordering the samples of a time series changes its autocorrelation but not its histogram. 

We say that signals whose empirical autocorrelation decays fast, such as an exponential, have short-memory or 
short-range dependence. If the empirical autocorrelation decays very slowly, as a hyperbolic function does, we say 
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that the signal has long-memory or long-range dependence. Furthermore, we shall see in the next section that 
effective modeling of time series with short or long memory requires different types of models. 

The spectral density function shows the distribution of signal power or energy as a function of frequency (see 
Figure 1.7). The autocorrelation and the spectral density of a signal form a Fourier transform pair and hence contain 
the same information. However, they present this information in different forms, and one can reveal information that 
cannot be easily extracted from the other. It is fair to say that the spectral density is more widely used than the 
autocorrelation. 

Signal Spectral density 


Power or energy 





Time Frequency 


FIGURE 1.7 
Illustration of the concept of power or energy spectral density function of a random signal. 


Joint signal analysis. In many applications, we are interested in the relationship between two different random 
signals. There are two cases of interest. In the first case, the two signals are of the same or similar nature, and we want 
to ascertain and describe the similarity or interaction between them. 

In the second case, we may have reason to believe that there is a causal relationship between the two signals. 
For example, one signal may be the input to a system and the other signal the output. The task in this case is to find an 
accurate description of the system, that is, a description that allows accurate estimation of future values of the output 
from the input. This process is known as system modeling or identification and has many practical applications, 
including understanding the operation of a system in order to improve the design of new systems or to achieve better 
control of existing systems. 


1.3 Signal Modeling 


In many theoretical and practical applications, we are interested in generating random signals with certain properties 
or obtaining an efficient representation of real-world random signals that captures a desired set of their characteristics 
(e.g., correlation or spectral features) in the best possible way. We use the term model to refer to a mathematical 
description that provides an efficient representation of the “essential” properties of a signal. 

For example, a finite segment {x(n)}* of any signal can be approximated by a linear combination of 
constant (A, =1) or exponentially fading (0< A, <1) sinusoids 


M 
x(n) =)" a," cos(@,n+ Q, ) (1.3.1) 
k=l 


where {a;, Ax, Ok, Øx} are the model parameters. A good model should provide an accurate description of the 
signal with 4M <WN parameters. From a practical viewpoint, we are most interested in parametric models, which 
assume a given functional form completely specified by a finite number of parameters. In contrast, nonparametric 
models do not put any restriction on the functional form or the number of model parameters. 

If any of the model parameters in (1.3.1) is random, the result is a random signal. The most widely used model is 
given by 


M 
x(n) = >) a, cos(@n+ g, ) 
k=l 
where the amplitudes {a,}" and the frequencies {@,}" are constants and the phases {g,}’ are random. This 
model is known as the harmonic process model and has many theoretical and practical applications. 
Suppose next that we are given a sequence w@(n) of independent and identically distributed observations. We 
can create a time series x(n) with dependent observations, by linearly combining the values of @(n) as 


x(n) = >" h(k)a(n—k) (1.3.2) 


k=-0o 
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which results in the widely used linear random signal model. The model specified by the convolution summation 
(1.3.2) is clearly nonparametric because, in general, it depends on an infinite number of parameters. Furthermore, 
the model is a linear, time-invariant system with impulse response h(k) that determines the memory of the model 
and, therefore, the dependence properties of the output x(n). By properly choosing the weights h(k), we can 
generate a time series with almost any type of dependence among its samples. 

In practical applications, we are interested in linear parametric models. As we will see, parametric models 
exhibit a dependence imposed by their structure. However, if the number of parameters approaches the range of the 
dependence (in number of samples), the model can mimic any form of dependence. The list of desired features for a 
good model includes these: (1) the number of model parameters should be as small as possible (parsimony), (2) 
estimation of the model parameters from the data should be easy, and (3) the model parameters should have a 
physically meaningful interpretation. 

If we can develop a successful parametric model for the behavior of a signal, then we can use the model for 
various applications: 

1. To achieve a better understanding of the physical mechanism generating the signal (e.g., earth structure in the case 
of seismograms). 

2. To track changes in the source of the signal and help identify their cause (e.g., EEG). 

3. To synthesize artificial signals similar to the natural ones (e.g., speech, infrared backgrounds, natural scenes, data 
network traffic). 

4. To extract parameters for pattern recognition applications (e.g., speech and character recognition). 

5. To get an efficient representation of signals for data compression (e.g., speech, audio, and video coding). 

6. To forecast future signal behavior (e.g., stock market indexes) (Pindyck and Rubinfeld 1998). 


In practice, signal modeling involves the following steps: (1) selection of an appropriate model, (2) selection of the 
“right” number of parameters, (3) fitting of the model to the actual data, and (4) model testing to see if the model 
satisfies the user requirements for the particular application. As we shall see in Chapter 8 this process is very 
complicated and depends heavily on the understanding of the theoretical model properties (see Chapter 3), the amount of 
familiarity with the particular application, and the experience of the user. 


Rational or Pole-Zero Models 


Suppose that a given sample x(n), at time n, can be approximated by the previous sample weighted by a 
coefficient a, that is, x(n) = ax(n—1), where a is assumed constant over the signal segment to be modeled. To make 
the above relationship exact, we add an excitation term «(n), resulting in 

x(n) = ax(n—1)+ @(n) (1.3.3) 
where a(n) is an excitation sequence. Taking the z -transform of both sides, we have 











X(z) =az'X(z)+W(z) (1.3.4) 
which results in the following system function: 
X(z 1 
H(z)= aa = iar (1.3.5) 
By using the identity 
hae =1+az'+a°z? +- -l<a<l (1.3.6) 
the single-parameter model in (1.3.3) can be expressed in the following nonparametric form 
x(n) = a(n) +aa@n—-1)+a’°@(n—2)+-: (1.3.7) 


which clearly indicates that the model generates a time series with exponentially decaying dependence. 
A more general model can be obtained by including a linear combination of the P previous values of the signal 
and of the Q previous values of the excitation in (1.3.3), that is, 


P Q 
x(n) =>" (-a,)x(n-k)+ J d,@(n-k) (1.3.8) 
k=l k=0 


The resulting system function 


8 Statistical and Adaptive Signal Processing 








Ate z = (1.3.9) 


is rational, that is, a ratio of two polynomials in the variable z™', hence the term rational models. We will show in 
Chapter 3 that any rational model has a dependence structure or memory that decays exponentially with time. 
Because the roots of the numerator polynomial are known as zeros and the roots of the denominator polynomial as 
poles, these models are also known as pole-zero models. In the time-series analysis literature, these models are known 
as autoregressive moving-average (ARMA) models. 

Modeling the vocal tract. An example of the application of the pole-zero model is for the characterization of the 
speech production system. Most generally, speech sounds are classified as either voiced or unvoiced. For both of 
these types of speech, the production is modeled by exciting a linear system, the vocal tract, with an excitation having 
a flat, that is, constant, spectrum. The vocal tract, in turn, is modeled by using a pole-zero system, with the poles 
modeling the vocal tract resonances and the zeros serving the purpose of dampening the spectral response between 
pole frequencies. In the case of voiced speech, the input to the vocal tract model is a quasi-periodic pulse waveform, 
whereas for unvoiced speech the source is modeled as random noise. The system model of the speech production 
process is shown in Figure 1.8. The parameters of this model are the voiced/unvoiced classification, the pitch period 
for voiced sounds, the gain parameter, and the coefficients {d,} and {a,} of the vocal tract filter (1.3.9). This 
model is widely used for low-bit-rate (less than 2.4 kbits/s) speech coding, synthetic speech generation, and extraction 
of features for speech and speaker recognition (Rabiner and Schafer 1978; Rabiner and Juang 1993;Furui 1989). 
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FIGURE 1.8 

Speech synthesis system based on pole-zero modeling. 

A classification of the various signal models described previously is given in Figure 1.9, which also provides 
information about the chapters of the book where these signals are discussed. 
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FIGURE 1.9 


Classification of random signal models. 
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1.4 Adaptive Filtering 


Conventional frequency-selective digital filters with fixed coefficients are designed to have a given frequency 
response chosen to alter the spectrum of the input signal in a desired manner. Their key features are as follows: 


1. The filters are linear and time-invariant. 

2. The design procedure uses the desired passband, transition bands, passband ripple, and stopband attenuation. We 
do not need to know the sample values of the signals to be processed. 

3. Since the filters are frequency-selective, they work best when the various components of the input signal occupy 
nonoverlapping frequency bands. For example, it is easy to separate a signal and additive noise when their spectra 
do not overlap. 

4. The filter coefficients are chosen during the design phase and are held constant during the normal operation of the 
filter. 


However, there are many practical application problems that cannot be successfully solved by using fixed digital 
filters because either we do not have sufficient information to design a digital filter with fixed coefficients or the 
design criteria change during the normal operation of the filter. Most of these applications can be successfully solved 
by using special “smart” filters known collectively as adaptive filters. The distinguishing feature of adaptive filters is 
that they can modify their response to improve performance during operation without any intervention from the user. 


1.4.1 Applications of Adaptive Filters 


The best way to introduce the concept of adaptive filtering is by describing some typical application problems 
that can be effectively solved by using an adaptive filter. The applications of adaptive filters can be sorted for 
convenience into four classes: (1) system identification, (2) system inversion, (3) signal prediction, and (4) 
multisensor interference cancelation (see Figure 1.14 and Table 1.1). We next describe each class of applications and 
provide a typical example for each case. 


TABLE 1.1 

Classification of adaptive filtering applications. 

Application class Examples 

System identification Echo cancelation 
Adaptive control 


Channel modeling 
System inversion Adaptive equalization 
Blind deconvolution 
Signal prediction Adaptive predictive coding 
Change detection 


Radio frequency interference cancelation 


Multisensor interference cancelation Acoustic noise control 
Adaptive beamforming 
System Identification 


This class of applications, known also as system modeling, is illustrated in Figure 1.10 (a). The system to be 
modeled can be either real, as in control system applications, or some hypothetical signal transmission path (e.g., the 
echo path). The distinguishing characteristic of the system identification application is that the input of the adaptive 
filter is noise-free and the desired response is corrupted by additive noise that is uncorrelated with the input signal. 
Applications in this class include echo cancelation, channel modeling, and identification of systems for control 
applications (Gitlin et al. 1992; Ljung 1987; Astrém and Wittenmark 1990). In control applications, the purpose of 
the adaptive filter is to estimate the parameters or the state of the system and then to use this information to design a 
controller. In signal processing applications, the goal is to obtain a good estimate of the desired response according to 
the adopted criterion of performance. 
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FIGURE 1.10 


The four basic classes of adaptive filtering applications: (a) system identification, (b) system inversion, (c) signal prediction, and (d) 


multisensor interference cancelation. 


Acoustic echo cancelation. Figure 1.11 shows a typical audio teleconferencing system that helps two groups of 
people, located at two different places, to communicate effectively. However, the performance of this system is 
degraded by the following effects: (1) The reverberations of the room result from the fact that the microphone picks 
up not only the speech coming from the talker but also reflections from the walls and furniture in the room. (2) 
Echoes are created by the acoustic coupling between the microphone and the loudspeaker located in the same room. 
Speech from room B not only is heard by the listener in room A but also is picked up by the microphone in room A, 


and unless it is prevented, will return as an echo to the speaker in room B. 


Several methods to deal with acoustic echoes have been developed. However, the most effective technique to 
prevent or control echoes is adaptive echo cancelation. The basic idea is very simple: To cancel the echo, we generate 
a replica or pseudo-echo and then subract it from the real echo. To synthesize the echo replica, we pass the signal at 
the loudspeaker through a device designed to duplicate the reverberation and echo properties of the room (echo path), 


as is illustrated in Figure 1.12. 
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Typical teleconferencing system without echo control. 
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FIGURE 1.12 
Principle of acoustic echo cancelation using an adaptive echo canceler. 


In practice, there are two obstables to this approach. (1) The echo path is usually unknown before actual 
transmission begins and is quite complex to model. (2) The echo path is changing with time, since even the move of a 
talker alters the acoustic properties of the room. Therefore, we cannot design and use a fixed echo canceler with 
satisfactory performance for all possible connections. There are two possible ways around this problem: 

1. Design a compromise fixed echo canceler based on some “average” echo path, assuming that we have sufficient 
information about the connections to be seen by the canceler. 

2. Design an adaptive echo canceler that can “learn” the echo path when it is first turned on and afterward “tracks” its 
variations without any intervention from the designer. Since an adaptive canceler matches the echo patch for any 
given connection, it performs better than a fixed compromise canceler. 


We stress that the main task of the canceler is to estimate the echo signal with sufficient accuracy; the estimation 
of the echo path is simply the means for achieving this goal. The performance of the canceler is measured by the 
attenuation of the echo. The adaptive echo canceler achieves this goal, by modifying its response, using the residual 
echo signal in an as-yet-unspecified way. More details about acoustic echo cancelation can be found in Gilloire et al. 
(1996). 


System inversion 


This class of applications, which is illustrated in Figure 1.10 (b), is also known as inverse system modeling. The 
goal of the adaptive filter is to estimate and apply the inverse of the system. Dependent on the application, the input 
of the adaptive filter may be corrupted by additive noise, and the desired response may not be available. The 
existence of the inverse system and its properties (e.g., causality and stability) creates additional complications. 
Typical applications include adaptive equalization (Gitlin et al. 1992), seismic deconvolution (Robinson 1984), and 
adaptive inverse control (Widrow and Walach 1994). 


Channel equalization. To understand the basic principles of the channel equalization techniques, we consider a 
binary data communication system that transmits a band-limited analog pulse with amplitudes A (symbol 1) or —A 
(symbol 0) every T, s (see Figure 1.13). Here T, is known as the symbol interval and R, =1/T, as the baud rate. 
As the signal propagates through the channel, it is delayed and attenuated in a frequency-dependent manner. 
Furthermore, it is corrupted by additive noise and other natural or man-made interferences. The goal of the receiver is 
to measure the amplitude of each arriving pulse and to determine which one of the two possible pulses has been sent. 
The received signal is sampled once per symbol interval after filtering, automatic gain control, and carrier removal. 
The sampling time is adjusted to coincide with the “center” of the received pulse. The shape of the pulse is chosen to 
attain the maximum rate at which the receiver can still distinguish the different pulses. To achieve this goal, we 
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usually choose a band-limited pulse that has periodic zero crossings every T, s. 
Noise 
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FIGURE 1.13 

Simple model of a digital communications system. 

If the periodic zero crossings of the pulse are preserved after transmission and reception, we can measure its 
amplitude without interference from overlapping adjacent pulses. However, channels that deviate from the ideal 
response (constant magnitude and linear phase) destroy the periodic zero-crossing property and the shape of the peak 
of the pulse. As a result, the tails of adjacent pulses interfere with the measurement of the current pulse and can lead 
to an incorrect decision. This type of degradation, which is known as intersymbol interference (ISI), is illustrated in 
Figure 1.14. 
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FIGURE 1.14 
Pulse trains (a) without intersymbol interference and (b) with intersymbol interference. 


We can compensate for the ISI distortion by using a linear filter called an equalizer. The goal of the equalizer is 
to restore the received pulse, as closely as possible, to its original shape. The equalizer transforms the channel to a 
near-ideal one if its response resembles the inverse of the channel. Since the channel is unknown and possibly 
time-varying, there are two ways to approach the problem: (1) Design a fixed compromise equalizer to obtain 
satisfactory performance over a broad range of channels, or (2) design an equalizer that can “learn” the inverse of the 
particular channel and then “track” its variation in real time. 

The characteristics of the equalizer are adjusted by some algorithm that attempts to attain the best possible 
performance. The most appropriate criterion of performance for data transmission systems is the probability of 
symbol error. However it cannot be used for two reasons: (1) The “correct” symbol is unknown to the receiver 
(otherwise there would be no reason to communicate), and (2) the number of decisions (observations) needed to 
estimate the low probabilities of error is extremely large. Thus, practical equalizers assess their performance by using 
some function of the difference between the “correct” symbol and the output. The operation of practical equalizers 
involves two modes of operation, dependent on how we substitute for the unavailable correct symbol sequence. (1) A 
known training sequence is transmitted, and the equalizer attempts to improve its performance by comparing its 
output to a synchronized replica of the training sequence stored at the receiver. Usually this mode is used when the 
equalizer starts a transmission session. (2) At the end of the training session, when the equalizer starts making reliable 
decisions, we can replace the training sequence with the equalizer’s own decisions. 

Adaptive equalization is a mature technology that has had the greatest impact on digital communications 
systems, including voiceband, microwave and troposcatter radio, and cable TV modems (Qureshi 1985; Lee and 
Messerschmitt 1994; Gitlin et al. 1992; Bingham 1988; Treichler et al. 1996). 
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Signal prediction 


In the next class of applications, the goal is to estimate the value x(nọ) of a random signal by using a set of 
consecutive signal samples {x(n),n <n <m}. There are three cases of interest: (1) forward prediction, when 
Ng > m ; (2) backward “prediction,” when 1M < n; and (3) smoothing or interpolation, when n, < NM < m, . Clearly, 
in the last case the value at n =m is not used in the computation of the estimate. The most widely used type is 
forward linear prediction or simply linear prediction [see Figure 1.10(c)], where the estimate is formed by using a 
linear combination of past samples (Makhoul 1975). 


Linear predictive coding (LPC). The efficient storage and transmission of analog signals using digital systems 
requires the minimization of the number of bits necessary to represent the signal while maintaining the quality to an 
acceptable level according to a certain criterion of performance. The conversion of an analog (continuous-time, 
continuous-amplitude) signal to a digital (discrete-time, discrete-amplitude) signal involves two processes: sampling 
and quantization. Sampling converts a continuous-time signal to a discrete-time signal by measuring its amplitude at 
equidistant intervals of time. Quantization involves the representation of the measured continuous amplitude using a 
finite number of symbols and always creates some amount of distortion (quantization noise). 

For a fixed number of bits, decreasing the dynamic range of the signal (and therefore the range of the quantizer) 
decreases the required quantization step and therefore the average quantization error power. Therefore, we can 
decrease the quantization noise by reducing the dynamic range or equivalently the variance of the signal. If the signal 
samples are significantly correlated, the variance of the difference between adjacent samples is smaller than the 
variance of the original signal. Thus, we can improve quality by quantizing this difference instead of the original 
signal. This idea is exploited by the linear prediction system shown in Figure 1.15. This system uses a linear predictor 
to form an estimate (prediction) x(n) of the present sample x(n) as a linear combination of the M past samples, 
that is, 


M 
&(n) = >" a,x(n—k) (1.4.1) 
k=l 


The coefficients {a,}“ of the linear predictor are determined by exploiting the correlation between adjacent 
samples of the input signal with the objective of making the prediction error 

e(n) = x(n) — X(n) (1.4.2) 
as small as possible. If the prediction is good, the dynamic range of e(n) should be smaller than the dynamic range of 
x(n), resulting in a smaller quantization noise for the same number of bits or the same quantization noise with a 
smaller number of bits. The performance of the LPC system depends on the accuracy of the predictor. Since the 
Statistical properties of the signal x(n) are unknown and change with time, we cannot design an optimum fixed 
predictor. The established practical solution is to use an adaptive linear predictor that automatically adjusts its 
coefficients to compute a “good” prediction at each time instant. A detailed discussion of adaptive linear prediction 
and its application to audio, speech, and video signal coding is provided in Jayant and Noll (1984). 
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FIGURE 1.15 
Illustration of the linear prediction of a signal x(n) using a finite number of past samples. 


Multisensor interference cancelation 


The key feature of this class of applications is the use of multiple sensors to remove undesired interference and 
noise. Typically, a primary signal contains both the signal of interest and the interference. Other signals, known as 
reference signals, are available for the purposes of canceling the undesired interference [see Figure 1.10 (d)]. These 
reference signals are collected using other sensors in which the signal of interest is not present or is so weak that it 
can be ignored. The amount of correlation between the primary and reference signals is measured and used to form an 
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estimate of the interference in the primary signal, which is subsequently removed. Had the signal of interest been 
present in the reference signal(s), then this process would have resulted in the removal of the desired signal as well. 
Typical applications in which interference cancelation is employed include array processing for radar and 
communications, biomedical sensing systems, and active noise control (Widrow et al. 1975; Kuo and Morgan 1996). 


Active noise control (ANC). The basic idea behind an ANC system is the cancelation of acoustic noise using 
destructive wave interference. To create destructive interference that cancels an acoustic noise wave (primary) at a 
point P, we can use a loudspeaker that creates, at the same point P, another wave (secondary) with the same 
frequency, the same amplitude, and 180° phase difference. Therefore, with appropriate control of the peaks and 
troughs of the secondary wave, we can produce zones of destructive interference (quietness). ANC systems using 
digital signal processing technology find applications in air-conditioning ducts, aircraft, cars, and magnetic resonance 
imaging (MRI) systems (Elliott and Nelson 1993; Kuo and Morgan 1996). 

Figure 1.16 shows the key components of an adaptive ANC system described in Crawford et al. 1997. The task 
of the loudspeaker is to generate an acoustic wave that is an 180° phase-inverted version of the signal y(t) when it 
arrives at the error microphone. In this case the error signal e(t) = y(t)+ }(t)=0, and we create a “quiet zone” 
around the microphone. If the acoustic paths (1) from the noise source to the reference microphone (G,), (2) from 
the noise source to the error microphone (Gy ), (3) from the secondary loudspeaker to the reference microphone 
(H), and (4) from the secondary loudspeaker to the error microphone (Hs) are linear, time-invariant, and known, 
we can design a linear filter H such that e(n)=0. For example, if the effects of H, and H; are negligible, the 
filter H should invert G, to obtain v(t) and then replicate G, to synthesize }(t)= y(t). The quality of 
cancelation depends on the accuracy of these two modeling processes. 
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FIGURE 1.16 
Block diagram of the basic components of an active noise control system. 


In practice, the acoustic environment is unknown and time-varying. Therefore, we cannot design a fixed ANC 
filter with satisfactory performance. The only feasible solution is to use an adaptive filter with the capacity to identify 
and track the variation of the various acoustic paths and the spectral characteristics of the noise source in real time. 
The adaptive ANC filter adjusts its characteristics by trying to minimize the energy of the error signal e(n). 
Adaptive ANC using digital signal processing technology is an active area of research, and despite several successes 
many problems remain to be solved before such systems find their way to more practical applications (Crawford et al. 
1997). 


1.4.2 Features of Adaptive Filters 


Careful inspection of the applications discussed in the previous section indicates that every adaptive filter consists of 
the following three modules (see Figure 1.17). 
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1. Filtering structure. This module forms the output of the filter using measurements of the input signal or signals. 
The filtering structure is linear if the output is obtained as a linear combination of the input measurements; 
otherwise, it is said to be nonlinear. The structure is fixed by the designer, and its parameters are adjusted by the 
adaptive algorithm. 

2. Criterion of performance (COP). The output of the adaptive filter and the desired response (when available) are 
processed by the COP module to assess its quality with respect to the requirements of the particular application. 

3. Adaptive algorithm. The adaptive algorithm uses the value of the criterion of performance, or some function of it, 
and the measurements of the input and desired response (when available) to decide how to modify the parameters 


of the filter to improve its performance. 
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Basic elements of a general adaptive filter. 
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Every adaptive filtering application involves one or more input signals and a desired response signal that may or 
may not be accessible to the adaptive filter. We collectively refer to these relevant signals as the signal operating 
environment (SOE) of the adaptive filter. The design of any adaptive filter requires a great deal of a priori 
information about the SOE and a deep understanding of the particular application (Claasen and Mecklenbrauker 
1985). This information is needed by the designer to choose the filtering structure and the criterion of performance 
and to design the adaptive algorithm. To be more specific, adaptive filters are designed for a specific type of input 
signal (speech, binary data, etc.), for specific types of interferences (additive white noise, sinusoidal signals, echoes of 
the input signals, etc.), and for specific types of signal transmission paths (e.g., linear time-invariant or time-varying). 
After the proper design decisions have been made, the only unknowns, when the adaptive filter starts its operation, 
are a set of parameters that are to be determined by the adaptive algorithm using signal measurements. Clearly, 
unreliable a priori information and/or incorrect assumptions about the SOE can lead to serious performance 
degradations or even unsuccessful adaptive filter applications. 

If the characteristics of the relevant signals are constant, the goal of the adaptive filter is to find the parameters 
that give the best performance and then to stop the adjustment. However, when the characteristics of the relevant 
signals change with time, the adaptive filter should first find and then continuously readjust its parameters to track 
these changes. 

A very influential factor in the design of adaptive algorithms is the availability of a desired response signal. We 

have seen that for certain applications, the desired response may not be available for use by the adaptive filter. In this 
book we focus on supervised adaptive filters that require the use of a desired response signal and we simply call them 
adaptive filters. 
_ Suppose now that the relevant signals can be modeled by stochastic processes with known statistical properties. 
If we adopt the minimum mean square error as a criterion of performance, we can design, at least in principle, an 
optimum filter that provides the ultimate solution. From a theoretical point of view, the goal of the adaptive filter is to 
replicate the performance of the optimum filter without the benefit of knowing and using the exact statistical 
properties of the relevant signals. In this sense, the theory of optimum filters (see Chapters 5 and 6) is a prerequisite 
for the understanding, design, performance evaluation, and successful application of adaptive filters. 
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1.5 Organization of the Book 





In this section we provide an overview of the main topics covered in the book so as to help the reader navigate 
through the material and understand the interdependence among the various chapters (see Figure 1.18). 

In Chapter 2, we elaborate on certain topics that are crucial to developments in subsequent chapters. Reading 
this chapter is essential to familiarize the reader with notation and properties that are repeatedly used throughout the 
rest of the book. Chapter 4 presents the most practical methods for nonparametric estimation of correlation and 
spectral densities. The use of these techniques for exploratory investigation of the relevant signal characteristics 
before performing any modeling or adaptive filtering is invaluable. 
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FIGURE 1.18 
Flowchart organization of the book’s chapters. 


Chapters 3 and 5 provide a detailed study of the theoretical properties of signal models and optimum filters, 
assuming that the relevant signals can be modeled by stochastic processes with known statistical properties. In 
Chapter 6, we develop algorithms and structures for optimum filtering and signal modeling and prediction. 

Chapter 7 introduces the general method of least squares and shows how to use it for the design of filters and 
predictors from actual signal observations. The statistical properties and the numerical computation of least-squares 
estimates are also discussed in detail. 

Chapters 8 and 9 use the theoretical work in Chapters 3, 5 and 6 and the practical methods in Chapter 7 to 
develop, evaluate, and apply practical techniques for signal modeling, adaptive filtering. 
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CHAPTER 2 


Random Sequences 


Deterministic signals are the signals whose amplitude is uniquely specified by a mathematical formula or rule. 
However, there are many important examples of signals whose precise description (i.e., as deterministic signals) is 
extremely difficult, if not impossible. Although random signals are evolving in time in an unpredictable manner, their 
average properties can be often assumed to be deterministic; that is, they can be specified by explicit mathematical 
formulas. This is the key for the modeling of a random signal as a stochastic process. 

Our aim in the subsequent discussions is to present some basic results from discrete-time stochastic processes 
that will be useful in the chapters that follow. We assume that most readers have some basic knowledge of these 
topics, and so parts of this chapter may be treated as a review exercise. However, some specific topics are developed 
in greater depth with a viewpoint that will serve as a foundation for the rest of the book. A more complete treatment 
can be found in Papoulis (1991), Helstrom (1992), and Stark and Woods (1994). 


2.1 Discrete-Time Stochastic Processes 


Many natural sequences can be characterized as random signals because we cannot determine their values precisely, 
that is, they are unpredictable. A natural mathematical framework for the description of these discrete-time random 
signals is provided by discrete-time stochastic processes. 

To obtain a formal definition, consider an experiment with a finite or infinite number of unpredictable outcomes 
from a sample space S ={€1,¢2,---}, each occurring with a probability Pr{¢i},k =1,2,---. By some rule we 
assign to each element ¢, of S a deterministic sequence x(n,¢,),-0c<n<co. The sample space S, the 
probabilities Pr{¢,}, and the sequences x(n,¢;),-co<n<oco, constitute a discrete-time stochastic process or 
random sequence. x(n, ¢),—co <n < œ, is a random sequence if for a fixed value no of n,x(m,¢) is a random 
variable. 

The set of all possible sequences {x(n,¢)} is called an ensemble, and each individual sequence x(n, ¢x), 
corresponding to a specific value of ¢ = ¢;, is called a realization or a sample sequence of the ensemble. 

There are four possible interpretations of x(n,¢) . depending on the character of n and ¢ , as illustrated in 
Figure 2.1. 

e x(n,¢) isarandom variable if ” is fixedand ¢ isa variable. 
e x(n,¢) isasample sequence if ¢ is fixed and ^ isa variable. 
e x(n,¢) isanumberif both n and ¢ are fixed. 

e x(n,¢) isa stochastic process if both n and ¢ are variables. 

A random sequence is also called a time series in the statistics literature. It is a sequence of random variables, or 
it can be thought of as an infinite-dimensional random vector. As with any collection of infinite objects, one has to be 
careful with the asymptotic (or convergence) properties of a random sequence. If n is a continuous variable taking 
values in R, then x(n,¢) is an uncountable collection of random variables or an ensemble of waveforms. This 
ensemble is called a continuous-time stochastic process or a random process. Although these processes can be 
handled similarly to sequences, they are more difficult to deal with in a rigorous mathematical manner than sequences 
are. Furthermore, practical signal processing requires discrete-time signals. Hence in this book we consider random 
sequences rather than random waveforms. 

Finally, in passing we note that the word stochastic is derived from the Greek word stochasticos, which means 
skillful in aiming or guessing. Hence, the terms random process and stochastic process will be used interchangeably 
throughout this book. 

A deterministic signal is by definition exactly predictable. This assumes that there exists a certain functional 
relationship that completely describes the signal, even if this relationship is not available. The unpredictability of a 
random process is, in general, the combined result of two things. First, the selection of a single realization is based on 
the outcome of a random experiment. Second, no functional description is available for all 
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FIGURE 2.1 
Graphical description of random sequences. 


realizations of the ensemble. However, in some special cases, such a functional relationship is available. This means 
that after the occurrence of a specific realization, its future values can be predicted exactly from its past ones. If the 
future samples of any realization of a stochastic process can be predicted from the past ones, the process is called 
predictable or deterministic, otherwise, it is said to be a regular process. For example, the process x(n,¢)=c, 
where c is a random variable, is a predictable stochastic process because every realization is a discrete-time signal 
with constant amplitude. In practice, we most often deal with regular stochastic processes. 

The simplest description of any random signal is provided by an amplitude-versus-time plot. Inspection of this 
plot provides qualitative information about some significant features of the signal that are useful in many applications. 
These features include, among others, the following: 

1. The frequency of occurrence of various signal amplitudes, described by the probability distribution of samples. 

2. The degree of dependence between two signal samples, described by the correlation between them. 

3. The existence of “cycles” or quasi-periodic patterns, obtained from the signal power spectrum (which will be 
described in Section 2.1.6). 

4. Indications of variability in the mean, variance, probability density, or spectral content. 

The first feature above, the amplitude distribution, is obtained by plotting the histogram, which is an estimate of 
the first-order probability density of the underlying stochastic process. The probability density indicates waveform 
features such as “spikiness” and boundedness. Its form is crucial in the design of reliable estimators, quantizers, and 
event detectors. 

The dependence between two signal samples (which are random variables) is given theoretically by the 
autocorrelation sequence and is quantified in practice by the empirical correlation, which is an estimate of the 
autocorrelation sequence of the underlying process. It affects the rate of amplitude change from sample to sample. 

Cycles in the data are related to sharp peaks in the power spectrum or periodicity in the autocorrelation. 
Although the power spectrum and the autocorrelation contain the same information, they present it in different 
fashions. 

Variability in a given quantity (e.g., variance) can be studied by evaluating this quantity for segments that can be 
assumed locally stationary and then analyzing the segment-to-segment variation. Such short-term descriptions should 
be distinguished from long-term ones, where the whole signal is analyzed as a single segment. 

All the above features, to a lesser or greater extent, are interrelated. Therefore, it is impossible to point out 
exactly the effect of each one upon the visual appearance of the signal. However, a lot of insight can be gained by 
introducing the concepts of signal variability and signal memory, which are discussed in Sections 2.1.5 and 2.2.3 


CHAPTER 2 Random Sequences 19 


respectively. 


2.1.1 Description Using Probability Functions 


From Figure 2.1, it is clear that at n=1, x(m,¢) is a random variable that requires a first-order probability 
function, say cdf F,(x;no), for its description. Similarly, x(m,¢) and x(m,¢) are joint random variables at 
instances nı and m, respectively, requiring a joint cdf F(x), X2;⁄, m). Stochastic processes contain infinitely many 
such random variables. Hence they are completely described, in a statistical sense, if their k th-order distribution 
function 


F ttt 55°74, ) = Pr{x(n,) S x,°°+, x(n, ) Sx, } (2.1.1) 
is known for every value of k21 and for all instances n,,m,...,n,.The k th-order pdf is given by 


OF Xt Xj Nn, n 
eae oe SSO a aT k>1 (2.1.2) 
OX, OX, 


Clearly, the probabilistic description requires a lot of information that is difficult to obtain in practice except for 
simple stochastic processes. However, many (but not all) properties of a stochastic process can be described in terms 
of averages associated with its first- and second-order densities. 

For simplicity, in the rest of the book, we will use a compact notation x(n) to represent either a random 
process x(n,¢) ora single realization x(n), which is a member of the ensemble. Thus we will drop the variable 
¢ from all notations involving random variables, vectors, or processes. We believe that this will not cause any 
confusion and that the exact meaning will be clear from the context. Also the random process x(n) is assumed to 
be complex-valued unless explicitly specified as real-valued. 


2.1.2 Second-Order Statistical Description 


The second-order statistic of x(n) at time n is specified by its mean value u (n) and its variance ci (n), 
defined by 


H,(n) = E{x(n)} = E{xg(n)+ jx, (n)} (2.1.3) 


and o: (n) = E{| x(n) -4 (M) P} = Efl x(n) P- P (2.1.4) 


respectively. Note that both 4,(n) and O(n) are, in general, deterministic sequences. 

The second-order statistics of x(n) at two different times n, and n, are given by the two-dimensional 
autocorrelation (or autocovariance) sequences. The autocorrelation sequence of a discrete-time random process is 
defined as the joint moment of the random variables x(n) and x(n,), that is, 


r (n,n, )= E{x(n,)x" (n,)} (2.1.5) 


It provides a measure of the dependence between values of the process at two different times. In this sense, it also 
provides information about the time variation of the process. The autocovariance sequence of x(n) is defined by 


Yo) = Efix) (a N) 4, (7%) 


x (2.1.6) 
= Fx (n ? n, ) p Hy (n, 4h (n,) 


We will use notations such as ¥,(n,,n,),7.(7,,n,),Y(1,,M,), Or r(m,n,) when there is no confusion as to which 
signal we are referring. Note that, in general, the second-order statistics are defined on a two-dimensional grid of 
integers. 

The statistical relation between two stochastic processes x(n) and y(n) that are jointly distributed (i.e., they 
are defined on the same sample space S) can be described by their cross-correlation and cross-covariance functions, 
defined by 
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Ty (n,n,)= E{x(n,)y"(n,)} (2.1.7) 
= Yo (mon) = E{[x(n,)— 4, (m, IL y(n,) — 4, (2, 1°} (2.1.8) 
=1r,,(m,N,) — L,(m,) ML, (n,) 
The normalized cross-correlation of two random processes x(n) and y(n) is defined by 
Puy (om) = ree (2.1.9) 


Some definitions 


We now describe some useful types of stochastic processes based on their statistical properties. A random 
process is said to be 
e An independent process if 


Fat Mess) = Fain Fey im) Vk,n,,i=1,--+,k (2.1.10) 


that is, x(n) is a sequence of independent random variables. If all random variables have the same pdf fx) for 
all k, then x(n) is called an IID (independent and identically distributed) random sequence. 
* An uncorrelated process if x(n) is a sequence of uncorrelated random variables, that is, 











o? (n) =n 
y(n )=] F M5, an )ón n) (2.1.11) 
0 n +n, 
Alternatively, we have 
o? oe 2 = 
TE «(m )+| Am) | n=m (2.1.12) 
H4 (m) (m) m+n, 
* An orthogonal process if it is a sequence of orthogonal random variables, that is, 
o2(n,.)+ 2 z 
ronm) =| s (m+ |.) | n= = E{| x(n,) [/}6(n, -n,) (2.1.13) 
0 n +n, 





e An independent increment process if Vk >1 and Wn, <n, <+--< n, , the increments 
{x(n )},{x(n,)— x(n, )},-++,{x(m,) — x(y_,)} 


are jointly independent. For such sequences, the kth-order probability function can be constructed as products of 
the probability functions of its increments. 
* A wide-sense periodic (WSP) process with period N if 


4, (n)= (n+ N) Vn (2.1.14) 


and r(n,nm)=r(n+N,n)=r (n,n, +N)=r(n+N,n, +N) (2.1.15) 


Note that in the above definition, //,(n) is periodic in one dimension while r,(”,,n,) is periodic in two 
dimensions. 
* A wide-sense cyclostationary process if there exists an integer N such that 


U(n)=u,(n+N) Vn (2.1.16) 
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and r(n,n,)=r,(n, +N,n, +N) (2.1.17) 


Note that in the above definition, r,(m,,”,) is not periodic in a two-dimensional sense. The correlation sequence 
is invariant to shift by N in both of its arguments. 
* If all kth-order distributions of a stochastic process are jointly Gaussian, then it is called a Gaussian random 
sequence. 
We can also extend some of these definitions to the case of two joint stochastic processes. The random processes 
x(n) and y(n) are said to be 
° Statistically independent if for all values of n, and n, 


fy % ynm) = fn), (yn) (2.1.18) 
e Uncorrelated if for every n and n, (n #n,) 
Y, (mm )=0 or r (m,m )= 4, (m) (n,) (2.1.19) 
e Orthogonal if for every n and n, (n +n,) 


ry (m,n) =0 (2.1.20) 


2.1.3 Stationarity 


A random process x(n) is called stationary if statistics determined for x(n) are equal to those for x(n+k), 
for every k. More specifically, we have the following definition. 


DEFINITION 2.1 (STATIONARY OF ORDER N). A stochastic process x(n) is called stationary of order N if 
Jast Xy Ny) = Fes Ny Shanes van) (2.1.21) 


for any value of k . if x(n) is stationary for all orders N =1,2,---, itis said to be strict-sense stationary (SSS). 


An IID sequence is SSS. However, SSS is more restrictive than necessary for most practical applications. A 


more relaxed form of stationarity, which is sufficient for practical problems, occurs when a random process is 
Stationary up to order 2, and it is also known as wide-sense stationarity. 


DEFINITION 2.2 (WIDE-SENSE STATIONARITY). A random signal x(n) is called wide-sense stationary (WSS) if 
1. Its mean is a constant independent of n, that is, 


E{x(n)} = 4, (2.1.22) 
2. Its variance is also a constant independent of n, that is, 
var[x(n)] = o? (2.1.23) 
and 
3. Its autocorrelation depends only on the distance ] =n, — n, , called lag, that is, 
r.(n,,n,) =r,(n, —n,) =r (1) = E{x(n+1)x' (n)} = E{x(n)x' (n-—1)} (2.1.24) 


From (2.1.22), (2.1.24), and (2.1.5) it follows that the autocovariance of a WSS signal also depends only on 
l=n,—n,, that is, 


yD =r (D-| 4, P (2.1.25) 


EXAMPLE 2.1.1. Let @(n) be a zero-mean, uncorrelated Gaussian random sequence with variance o?(n)=1. 
a. Characterize the random sequence @(n) . 
b. Define x(n) = a(n) + æ(n—1),— <n < œ% . Determine the mean and autocorrelation of x(n) . Also characterize x(n). 


Solution. Note that the variance of @(n) is a constant. 


a. Since uncorrelatedness implies independence for Gaussian random variables, @(n) is an independent random sequence. Since 
its mean and variance are constants, it is at least stationary in the first order. Furthermore, from (2.1.12) or (2.1.13) we have 


Ty (M,N) = 0° 5(n, —n,) = 5(n, — 1) 
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Hence @(n) is also a WSS random process. 
b. The mean of x(n) is zero for all n since q@(n) is a zero-mean process. Consider 
r(n,n,) = E{x(n,)x(n,)} 
= E{[a(n,) + an, -Illon,) + an, -1)])} 
=r(n,,n,)+r,(n,,n, -1)+r,(n, -1,n,) 
+r,(n, —1,n, =I) 
=0'6(n, —n,)+0°6(n, —n, +1) 
+0°0(n, —1—n,)+0°6(n, -1-n, +1) 
= 26(n, —n,)+ (n, —n, +1) + (n, —n, -1) 
Clearly, 7.(7,,n,) isa function of n,—n, . Hence 
rl) = 26(1) + 6(1+1)+ d(1-1) 
Therefore, x(n) is a WSS sequence. However, it is not an independent random sequence since both x(n) and x(n+1) 
dependon a(n). 


EXAMPLE 2.1.2. (WIENER PROCESS). Toss a fair coin at each n,—-co < n < o . Let 


+S if heads is outcome Pr(H) = 1 
an= i if tails is outcome Pr(T) =1 
where S is a step size. Clearly, @(n) is an independent random process with 
E{a@(n)} =0 
and Ela (n))=0=S° (3) +S (3) =S? 
Define a new random process x(n), n2 1 , as 
x(1) = a1) 


x(2) = x(1) + @(2) = @(1) + @(2) 


x(n) = x(n -1) + an) = > ai) 


Note that x(n) is arunning sum of independent steps or increments; thus it is an independent increment process. Such a sequence 
is called a discrete Wiener process or random walk. We can easily see that 


E{x(n)} = |S. oxo} =0 


and 


E(x(n)}= e| Sony at) |. ES Sena) 


i=l k=l 


= YY ELDA) =P Ela") ) =n? 


i=l k=l 


Therefore, random walk is a nonstationary(or evolutionary) process with zero mean and variance that grows with n, the number of 
steps taken. 


It should be stressed at this point that although any strict-sense stationary signal is wide-sense stationary, the 
inverse is not always true, except if the signal is Gaussian. 

Two random signals x(n) and y(n) are called jointly wide-sense stationary if each is wide-sense stationary 
and their cross-correlation depends only on / =n, —n, 


r (D= E{x(n)y (n-D} =r, O- hs, (2.1.26) 


Note that as a consequence of wide-sense stationarity the two-dimensional correlation and covariance sequences 
become one-dimensional sequences. This is a very important result that ultimately allows for a nice spectral 
description of stationary random processes. 


Properties of autocorrelation sequences 


The autocorrelation sequence of a stationary process has many important properties (which also apply to 
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autocovariance sequences, but we will discuss mostly correlation sequences). Vector versions of these properties are 
discussed extensively in Section 2.2.4, and their proofs are explored in the problems. 


PROPERTY 2.1.1. The average power of a WSS process x(ny satisfies 

r(0)=0+| u, P20 (2.1.27) 
and r.(0)2|r,(@)| for all 1 (2.1.28) 
Proof. See Problem 2.14and Property 2.1.6. 


This property implies that the correlation attains its maximum value at zero lag and this value is nonnegative. 
The quantity |4, |’ is referred to as the average dc power, and the quantity of = y,(0) is referred to as the average 
ac power of the random sequence. The quantity r,(0) then is the total average power of x(n). 

PROPERTY 2.1.2. The autocorrelation sequence r,(/) is a conjugate symmetric function of lag /, that is, 


r (-l) =r (D (2.1.29) 
Proof. It follows from Definition 2.2 and from (2.1.24). 


PROPERTY 2.1.3. The autocorrelation sequence r,(l) is nonnegative definite; that is, for any M >0 andany @,,@,, 
M M 
DD or(k-m)a, > 0 (2.1.30) 
k= 


m= 


This is a necessary and sufficient condition for a sequence r,(/) to be the autocorrelation sequence of a random sequence. 
Proof. See Problem 2.15. 


Since in this book we exclusively deal with wide-sense stationary processes, we will use the term stationary to 
mean wide-sense stationary. The properties of autocorrelation and cross-correlation sequences of jointly stationary 
processes, x(n) and y(n), are summarized in Table 2.1. 

Although SSS and WSS forms are widely used in practice, there are processes with different forms of 
stationarity. Consider the following example. 


EXAMPLE 2.1.3. Let x(n) be a real-valued random process generated by the system 
x(n) = ax(n—1)+ @(n) n20 x(-1) =0 (2.1.31) 


where @(n) is a stationary random process with mean /, and r,(1)=026(1) . The process x(n) generated using (2.1.31) is 
known as a first-order autoregressive, or AR(1), process,’ and the process @(n) is known as a white noise process (defined in 
Section 2.1.6). Determine the mean j/,(n) of x(n) and comment on its stationarity. 


Solution. To compute the mean of x(n), we express it as a function of {@(n),@(n—1),---,@(0)} as follows 


x(0) =arx(—1) + a0) = a0) 
x) =ax(0) + @(1) = aa(0) + a1) 


x(n) =a"@(0)+ a" '@(1) +--+ + O(n) = 5 a an—k) 


Hence the mean of x(n) is given by 


L,(n) = {> aan -»} =U, 
k=0 





Yat 
k=0 
Clearly, the mean of x(n) depends on n, and hence it is nonstationary. However, if we assume that | @|<1 (which implies that 
the system is BIBO stable), then as n — co, we obtain 





_fd-a™")-@u, al 
g (n+), a=l 


1 n+l 





4, (n) = Hy 


l= "™ 


‘Note that from (2.1.3), x(m—1) completely determines the distribution for x(n), and x(n) completely determines the distribution 
for x(n+1), and so on. If 


Jenisi r (Xn | Xn’? -) = Fccnyxint) (Xn | Xn) 


then the process is termed a Markov process. 
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Thus x(n) approaches first-order stationarity for large n. Similar analysis for the autocorrelation of x(n) shows that x(n) 
approaches wide-sense stationarity for large n (see Problem 2.23). 


The above example illustrates a form of stationarity called asymptotic stationarity. A stochastic process x(n) 
is asymptotically stationary if the statistics of random variables x(n) and x(n+k) become stationary as k 400. 
When LTI systems are driven by zero-mean uncorrelated-component random processes, the output process becomes 
asymptotically stationary in the steady state. Another useful form of stationarity is given by stationary increments. If 
the increments {x(n)—x(n—k)} of a process x(n) form a stationary process for every k, we say that x(n) is a 
process with stationary increments. Such processes can be used to model data in various practical applications. 

The simplest way, to examine in practice if a real-world signal is stationary, is to investigate the physical 
mechanism that produces the signal. If this mechanism is time-invariant, then the signal is stationary. In case it is 
impossible to draw a conclusion based on physical considerations, we should rely on statistical methods (Bendat and 
Piersol 1986; Priestley 1981). Note that stationarity in practice means that a random signal has statistical properties 
that do not change over the time interval we observe the signal. For evolutionary signals the statistical properties 
change continuously with time. An example of a highly nonstationary random signal is the signals associated with the 
vibrations induced in space vehicles during launch and reentry. However, there is a kind of random signal whose 
statistical properties change slowly with time. Such signals, which are stationary over short periods, are called locally 
stationary signals. Many signals of great practical interest, such as speech, EEG, and ECG, belong to this family of 
signals. 

Finally, we note that general techniques for the analysis of nonstationary signals do not exist. Thus only special 
methods that apply to specific types of nonstationary signals can be developed. Many such methods remove the 
nonstationary component of the signal, leaving behind another component that can be analyzed as stationary (Bendat 
and Piersol 1986; Priestley 1981). 


2.1.4 Ergodicity 


A stochastic process consists of the ensemble and a probability law. If this information is available, the statistical 
properties of the process can be determined in a quite straightforward manner. However, in the real world, we have 
access to only a limited number (usually one) of realizations of the process. The question that arises then is, Can we 
infer the statistical characteristics of the process from a single realization? 

This is possible for the class of random processes that are called ergodic processes. Roughly speaking, 
ergodicity implies that all the statistical information can be obtained from any single representative member of the 
ensemble. 


Time averages 


All the statistical averages that we have defined up to this point are known as ensemble averages because they 
are obtained by “freezing” the time variable and averaging over the ensemble (see Fig. 2.1). Averages of this type are 
formally defined by using the expectation operator E{ }. Ensemble averaging is not used frequently in practice, 
because it is impractical to obtain the number of realizations needed for an accurate estimate. Thus the need for a 
different kind of average, based on only one realization, naturally arises. Obviously such an average can be obtained 
only by time averaging. 

The time average of a quantity, related to a discrete-time random signal, is defined as 


Ay I ` 
<(-) >= lim —— f 2:14:32 
ao r (2.1.32) 


Note that, owing to its dependence on a single realization, any time average is itself a random variable. The time 
average is taken over all time because all realizations of a random process exist for all time; that is, they are power 
signals. 

For every ensemble average we can define a corresponding time average. The following time averages are of 
special interest: 
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Mean value= < x(n)> 


Mean square = <| x(n) P > 


Variance = <ļx(n)- < x(n) >> 
Autocorrelation = < x(n)x*(n—-1) > (2.1.33) 
Autocovariance = <[x(n)—< x(n) >][x(n—l)—< x(n) >] > 


Cross — correlation = < x(n)y*(n-1)> 


Cross — covariance = <[x(n)—< x(n) >][y(n—l)—< y(n) >] > 


It is necessary to mention at this point the remarkable similarity between time averages and the correlation 
sequences for deterministic power signals. Although this is just a formal similarity, due to the fact that random signals 
are power signals, both quantities have the same properties. However, we should always keep in mind that although 
time averages are random variables (because they are functions of ¢ ), the corresponding quantities for deterministic 
power signals are fixed numbers or deterministic sequences. 


Ergodic random processes 


As we have already mentioned, in many practical applications only one realization of a random signal is 
available instead of the entire ensemble. In general, a single member of the ensemble does not provide information 
about the statistics of the process. However, if the process is stationary and ergodic, then all statistical information can 
be derived from only one typical realization of the process. 

A random signal x(n) is called ergodic' if its ensemble averages equal appropriate time averages. There are 


several degrees of ergodicity (Papoulis 1991). We will discuss two of them: ergodicity in the mean and ergodicity in 
correlation. 


DEFINITION 2.3 (ERGODIC IN THE MEAN). A random process x(n) is ergodic in the mean if 
< x(n) >= E{x(n)} (2.1.34) 


DEFINITION 2.4 (ERGODIC IN CORRELATION). A random process X(n) is ergodic in correlation if 
< x(n)x" (n—1) >= E{x(n)x* (n—1)} (2.1.35) 


Note that since < x(n) > is constant and < x(n)x*(n—l)> is a function of l, if x(n) is ergodic in both the 
mean and correlation, then it is also WSS. Thus only stationary signals can be ergodic. On the other hand, WSS does 
not imply ergodicity of any kind. Fortunately, in practice almost all stationary processes are also ergodic, which is 
very useful for the estimation of their statistical properties. From now on we will use the term ergodic to mean both 
ergodicity in the mean and ergodicity in correlation. 


DEFINITION 2.5 (JOINT ERGODICITY). Two random signals are called jointly ergodic if they are individually ergodic and in 
addition 
< x(n) y'(n—1) >= E{x(n)y"(n-1)} (2.1.36) 


A physical interpretation of ergodicity is that one realization of the random signal x(n), as time n tends to 
infinity, takes on values with the same statistics as the value x(n,), corresponding to all samples of the ensemble 
members at a given time n = nı. 

In practice, it is of course impossible to use the time-average formulas introduced above, because only finite 
records of data are available. In this case, it is common practice to replace the operator (2.1.32) by the operator 


1 N 
<()>,= mii (-) (2.1.37) 


to obtain estimates of the true quantities. Our desire in such problems is to find estimates that become increasingly 
accurate (in a sense to be defined in Section 2.4) as the length 2N+1 of the record of used data becomes larger. 


'Strictly speaking, the form of ergodicity that we will use is called mean-square ergodicity since the underlying convergence of random 
variables is in the mean-square sense (Stark and Woods 1994). Therefore, equalities in the definitions are in the mean-square sense. 
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Finally, to summarize, we note that whereas stationarity ensures the time invariance of the statistics of a random 
signal, ergodicity implies that any statistics can be calculated either by averaging over all members of the ensemble at 
a fixed time or by time-averaging over any single representative member of the ensemble. 


2.1.5 Random Signal Variability 


If we consider a stationary random sequence a(n) thatis ZID with zero mean, its key characteristics depend on its 
first-order density. Figure 2.2 shows the probability density functions and sample realizations for IID processes with 
uniform, Gaussian, and Cauchy probability distributions. In the case of the uniform distribution, the amplitude of the 
random variable is limited to a range, with values occurring outside this interval with zero probability. On the other 
hand, the Gaussian distribution does not have a finite interval of support, allowing for the possibility of any value. 
The same is true of the Cauchy distribution, but its characteristics are dramatically different from those of the 
Gaussian distribution. The center lobe of the density is much narrower while the tails that extend out to infinity are 
significantly higher. As a result, the realization of the Cauchy random process contains numerous spikes or extreme 
values while the remainder of the process is more compact about the mean. Although the Gaussian random process 
allows for the possibility of large values, the probability of their occurrence is so small that they are not found in 


realizations of the process. 
1 1 
0 0 
-1 -1 
1 05 0 
Sample sequence (Gaussian) 
T a 

1 
‘ 0 
-1 

1 05 0 0 500 1000 

1 1 

0 0 

-1 -1 

1 05 0 

FIGURE 2.2 


Probability density functions and sample realizations of an IID process with uniform, Gaussian, and Cauchy distributions. 

The major difference between the Gaussian and Cauchy distributions lies in the area found under the tails of the 
density as it extends out to infinity. This characteristic is related to the variability of the process. The heavy tails, as 
found in the Cauchy distribution, result in an abundance of spikes in the process, a characteristic referred to as high 
variability. On the other hand, a distribution such as the Gaussian does not allow for extreme values and indicates low 
variability. The extent of the variability of a given distribution is determined by the heaviness of the tails. 
Distributions with heavy tails are called long-tailed distributions and have been used extensively as models of 
impulsive random processes. 
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DEFINITION 2.6. A distribution is called long-tailed if its tails decay hyperbolically or algebraically as 
Pr{| x(n) |2 x} ~ Cx as x — œ% (2.1.38) 
where C is a constant and the variable œ determines the rate of decay of the distribution. 
By means of comparison, the Gaussian distribution has an exponential rate of decay. The implication of the 
algebraically decaying tail is that the process has infinite variance, that is, 
2) 27) 
on = E{| x(n) | }=00 


and therefore lacks second-order moments. The lack of second-order moments means that, in addition to the variance, 
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the correlation functions of these processes do not exist. Since most signal processing algorithms are based on 
second-order moment theory, infinite variance has some extreme implications for the way in which such processes 
are treated. 

In this book, we shall model high variability, and hence infinite variance, using the family of symmetric stable 
distributions. The reason is twofold: First, a linear combination of stable random variables is stable. Second, stable 
distributions appear as limits in central limit theorems. Stable distributions are characterized by a parameter 
a,0<a<2. They are Cauchy when @=1 and Gaussian when @=2. However, they have finite variance only 
when @=2. 

In practice, the type of data under consideration governs the variability of the modeling distribution. Random 
signals restricted to a certain interval, such as the phase of complex random signals, are well suited for uniform 
distributions. On the other hand, signals allowing for any possible value but generally confined to a region are better 
suited for Gaussian models. However, if a process contains spikes and therefore has high variability, it is best 
characterized by a long-tailed distribution such as the Cauchy distribution. Impulsive signals have been found in a 
variety of applications, such as communication channels, radar signals, and electronic circuit noise. In all cases, the 
variability of the process dictates the appropriate model. 


2.1.6 Frequency-Domain Description of Stationary Processes 


Discrete-time stationary random processes have correlation sequences that are functions of a single index. This leads 
to nice and powerful representations in both the frequency and the z -transform domains. 


Power spectral density 


The power spectral density (PSD, or more appropriately autoPSD) of a stationary stochastic process x(n) isa 
Fourier transformation of its autocorrelation sequence r,(/).If r,(J) is periodic (which corresponds to a wide-sense 
periodic stochastic process) in /, then the DTFS can be used to obtain the PSD, which has the form of a line spectrum. 
If r,(/) is nonperiodic, the DTFT can be used, provided that r,(/) is absolutely summable. This means that the 
process x(n) must be a zero-mean process. In general, a stochastic process can be a mixture of periodic and 
nonperiodic components. 

If we allow impulse functions in the DTFT to represent periodic (or almost periodic) sequences and 
non-zero-mean processes, then we can define the PSD as 


Rie y= > rne” (2.1.39) 
l= 
where œ is the frequency in radians per sample. If the process x(n) is a zero-mean nonperiodic process, then 
(2.1.39) is enough to determine the PSD. If x(n) is periodic (including nonzero mean) or almost periodic, then the 
PSD is given by 


R (e) = 2 21A,6(@- @,) (2.1.40) 


where the A; are amplitudes of r,(/) at frequencies @,. For discussion purposes we will assume that x(n) is a 
zero-mean nonperiodic process. The autocorrelation r,(/) can be recovered from the PSD by using the inverse 
DTFT as 


1 aa 
D=— | R (e) dw 2.1.41 
aD- Re) (2.1.41) 


EXAMPLE 2.1.4. Determine the PSD of a zero-mean WSS process x(n) with r(1)=a",-1<a<1. 
Solution. From (2.1.39) we have 





; = ; 1 1 
R (e) =¥ a'e” = —+— 
(2°) 2 l-ae’” 1-ae?® 


l-a’ 


1+a’ —2acos@ 


—1 
(2.1.42) 
-l<a<l 


Periodic components are predictable processes as discussed before. However, some nonperiodic components can also be predictable. 
Hence nonperiodic components are not always regular processes. 


28 Statistical and Adaptive Signal Processing 








which is a real-valued, even, and nonnegative function of @. 


Properties of the autoPSD. The power spectral density R, (e!”) has three key properties that follow from 
corresponding properties of the autocorrelation sequence and the DTFT. 


PROPERTY 2.1.4. The autoPSD R,(e!”) is a real-valued periodic function of frequency with period 2m for any (real- or 
complex-valued) process x(n)-.If x(n) is real-valued, then R,(e”) is also an even function of @, that is, 

R (e?) = R (e°) (2.1.43) 
Proof. It follows from autocorrelation and DTFT properties. 


PROPERTY 2.1.5. The autoPSD is nonnegative definite, that is, 
R(e'”) 20 (2.1.44) 


Proof. This follows from the nonnegative definiteness of the autocorrelation sequence [see also discussions leading to (2.2.27)]. 


PROPERTY 2.1.6. The area under R,(e’”) is nonnegative and it equals the average power of x(n) Indeed, from (2.1.41) it 
follows with ]=0 that 


L f R(e)de= r,(0)= Ell x(n) P} 20 (2.1.45) 
2m +z 


Proof. It follows from Property 2.1.5. 


White noise. A random sequence @(n) is called a (second-order) white noise process with mean He and 
variance o? , denoted by 


aln) ~ WN (Hp, 02) (2.1.46) 

if and only if E{@(n)}= 4, and 
r,l) = E{a@(n)@" (n-1)} =od (2.1.47) 
which implies that R,(e’)=0, -nS@<n (2.1.48) 


The term white noise is used to emphasize that all frequencies contribute the same amount of power, as in the case of 
white light, which is obtained by mixing all possible colors by the same amount. If, in addition, the pdf of x(n) is 
Gaussian, then the process is called a (second-order) white Gaussian noise process, and it will be denoted by 
WGN( Ho, Oo) - 

If the random variables @(n) are independently and identically distributed with mean and variance o2, 
then we shall write 


an) ~ IID(L,,,02,) (2.1.49) 


This is sometimes referred to as a strict white noise. 

We emphasize that the conditions of uncorrelatedness or independence do not put any restriction on the form of 
the probability density function of q@(n) . Thus we can have an IID process with any type of probability distribution. 
Clearly, white noise is the simplest random process because it does not have any structure. However, we will see that 
it can be used as the basic building block for the construction of processes with more complicated dependence or 
correlation structures. 


Harmonic processes. A harmonic process is defined by 
M 
x(n)= >) A cos(ant+g,)  @, #0 (2.1.50) 
k=l 


where M,{A,}, and {@,}” are constants, and {ø,}” are pairwise independent random variables uniformly 
distributed in the interval [0,27]. It can be shown (see Problem 2.9) that x(n) is a stationary process with mean 
E{x(n)}=0 for all n (2.1.51) 


and autocorrelation 
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M 

=> A cosa — o0 < | < œ% (2.1.52) 
k=l 

We note that r,(/) consists of a sum of “in-phase” cosines with the same frequencies as in x(n). 

If @,/(2n) are rational numbers, r,(/) is periodic and can be expanded as a Fourier series. These series 
coefficients provide the power spectrum R,(k) of x(n). However, because r,(/) is a linear superposition of 
cosines, it always has a line spectrum with 2M lines of strength A?/4 at frequencies +@,. If r,(1) is periodic, 
then the lines are equidistant (i.e., harmonically related), hence the name harmonic process. If @/(27) is irrational, 
then r,(J) is almost periodic and can be treated in the frequency domain in almost the same fashion. Hence the 
power spectrum of a harmonic process is given by 


, M A? M ox 
R(e*)= >. 2n{ A) oe-a4)= > 5 eo-a) (2.1.53) 
k=-M k=-M 
EXAMPLE 2.1.5. Consider the following harmonic process 
x(n) = cos (0.1lan+@,)+2sin (1.5n+@,) 
where g, and øg, are IID random variables uniformly distributed in the interval [0,22]. The first component of x(n) is 
periodic with @=0.la and period equal to 20 while the second component is almost periodic with @,=1.5. Thus the 
sequence x(n) is almost periodic. A sample function realization of x(n) is shown in Figure 2.3(a). The mean of x(n) is 
H,(n) = E{x(n)} = E{cos(0. Inn + g,)+2sin(1.5n+@,)}=0 


and the autocorrelation sequence (using mutual independence between œ) and @,) is 


r (m,m) = E{x(n,)x;(n,)} 
= E{cos(0.1an, + g,) cos(0.1nn, + @,)} 
+ E{2sin(1.5n, + 9,)2sin(1.5n, + @,)} 


= A soi In(n, —n,)]+2cos[1.5(n, —n,)] 


2 
or r (D) =} €080.1nl +2c081.51 l=n-n, 
Thus the line sepectrum RO is given by 
1 =-1.5 
l 
= @, =—0.ln 
R® wer 4 
a j] 
— =0.lr 
4 Q, 
1 @,=1.5 


and the power spectrum R,(e!”) is given by 
R,(e”) =2nd(o+1 5) +7 5+ 0. In) +7 5(@-0. In) + 2nd(@—1.5) 


The line spectrum of x(n) is shown in Figure 2.3(b) and the corresponding power spectrum in Figure 2.3(c). 


The harmonic process is predictable because any given realization is a sinusoidal sequence with fixed amplitude, 
frequency, and phase. We stress that the independence of the phases is required to guarantee the stationarity of x(n) 
in (2.1.50). The uniform distribution of the phases is necessary to make x(n) a stationary process (see Problem 2.9). 
The harmonic process (2.1.50), in general, is non-Gaussian; however, it becomes Gaussian if the amplitudes A, are 
random variables with a Rayleigh distribution (Porat 1994). 


EXAMPLE 2.1.6. Consider a complex-valued process given by 
sas = [Ale 


where A is a complex-valued random variable and q@, is constant. The mean of x(n) 
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E{x(n)} = E{A}e!®" 
can be constant only if E{A}=0. If |A| is constant and Y is uniformly distributed on [0,27], then we have 
E{A} =| A | E{e!”} =0. In this case the autocorrelation is 
r (m,n,) = EA? ae tr} =A? editinne a 


Since the mean is constant and the autocorrelation depends on the difference / =n, — n, , the process is wide-sense stationary. 





x(n) 
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FIGURE 2.3 


The time and frequency-domain description of the harmonic process in Example 2.1.5. 


The above example can be generalized to harmonic processes of the form 


M 
x(n) = J Agi" (2.1.54) 
k=l 
where M,{A;,}”, and {@,}” are constants and {g,}{ are pairwise independent random variables uniformly 


1 > 


distributed in the interval [0,27]. The autocorrelation sequence is 
M 
r= > A, fe (2.1.55) 
k=1 


and the power spectrum consists of M impulses with amplitudes 2r |A, |? at frequencies @,. If the amplitudes 
{A, }{4; are random variables, mutually independent of the random phases, the quantity | A, |? is replaced by 


E{| A P}. 
Cross-power spectral density 


The cross-power spectral density of two zero-mean and jointly stationary stochastic processes provides a 
description of their statistical relations in the frequency domain and is defined as the DTFT of their cross-correlation, 
that is, 


o0 


R,(e”)= >) r De” (2.1.56) 


| aes 


The cross-correlation 7, (1) can be recovered by the inverse DTFI 
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= l J@),jal 
ry) = 5 [_ 8, (ee do (2.1.57) 
The cross-spectrum Ry(e!”) is, in general, a complex function of @. From rw(l)=r4(—1) it follows that 
R (e!”) = Ri, (e°) (2.1.58) 


This implies that R,(e!”) and Ryx(e”) have the same magnitude but opposite phase. 
The normalized cross-spectrum 


R (e°) 


g (6) E — (2.1.59) 
l VR, (e) R, (e°) 
is called the coherence function. Its squared magnitude 
l [R e”)? 
| g (6?) P= 2.1.60 
hay )| R (@)R, (e°) ( ) 


is known as the magnitude square coherence (MSC) and can be thought of as a sort of correlation coefficient in the 
frequency domain. If x(n)=y(n), then Zy (e”) =] (maximum correlation) whereas if x(n) and y(n) are 
uncorrelated, then R„(1)=0 andhence “~,(e’”)=0. In other words, O <| A)(e’”) |S 1. 


Complex spectral density functions 


If the sequences r,(/) and ry(l) are absolutely summable within a certain ring of the complex z plane, we 
can obtain their z -transforms 


oo 


R(2)= > Oe" (2.1.61) 


|=—c0 


oo 


R,,(z)= > r, (0)z” (2.1.62) 


l=% 
which are known as the complex spectral density and complex cross-spectral density functions, respectively. If the 
unit circle, defined by z =e’, is within the region of convergence of the above summations, then 


R,(e!”) =R,(2) |, ow (2.1.63) 


Rie) =8, 60 (2.1.64) 


The correlation and power spectral density properties of random sequences are summarized in Table 2.1. 
EXAMPLE 2.1.7. Consider the random sequence given in Example 2.1.4 with autoPSD in (2.1.42) 


l-a’ 


R, (e= a|<1 
1+a° —2a cos w lal 
Determine the complex autoPSD R,(z). 
Solution. The complex autoPSD is given by R,(z) = R,(e*”) lioz Since 
el? +e? z+ z! 
cos @ = ———— = ——_ 
2 2 | ele 
we obtain 
l-a’ a-a"')z" 1 
Ridea— 8 a lai 
l4+a’—-2a(#e*) 1-(ata™)z'+z |a| 


Now the inverse Z -transform of R,(z) determines the autocorrelation sequence r,(/) , that is, 
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or 


_ (a-a")z" (a-a"')z" 
ROS ara e r Cane ae) 
PON: AES. A -i 
~ (l-az") (l-a'z) Jamieta 
r (D) =a'u(1)+(a')'u(-l-1) =a" (2.1.65) 


This approach can be used to determine autocorrelation sequences from autoPSD functions. 


Table 2.1 provides a summary of correlation and spectral properties of stationary random sequences. 


Table 2.1 

Summary of correlation and spectral properties of stationary random sequences. 
Definitions 

Mean value 4, = E{x(n)} 

Autocorrelation r (J) = E{[x(n)x*(n-1)} 

Autocovariance 7. (1) = E{ (x(n) — 4, lix(n-1)- 4,7} 

Cross-correlation ry (l) = E{x(n)y"(n—-1)} 


Cross-covariance 


Power spectral density 


Cross-power spectral density 


Magnitude square coherence 


Yq (D = E{[x(n) -uyn -D) - 4,7} 


o0 


R,(e*) = > r De” 


l=-co 


R,(e”) = > r, De" 
l= 


|G, (e) P= R (e'”) P /[R,(e)R, (e) 


Interrelations 


yD =r (D-1 4, P 


Yy (D = ry (D) - LH, 


Autocorrelation 

r,(l) is nonnegative definite 
rd) =r) 

17) |S 7,0) 

|p,O|s1 
Cross-correlation 

Ny (Ll) =r) 

|r) |S ir (0r, 0)? < 
[r,(0) + r, (0)]/2 





Properties 
Auto-PSD 


R,(e°)20 and real 


R, (e) =R, (e) [real x(n) ] 


R, (2) = R] (1/z*) 
R,(z)=R,(z") [real x(n) ] 


Cross-PSD 





R,,(z) = R; (1/z*) 


05|G,(e*)|s1 


| PQ |s1 
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2.2 Linear Systems with Stationary Random Inputs 


This section deals with the processing of stationary random sequences using linear, time-invariant (LTI) systems. We 
focus on expressing the second-order statistical properties of the output in terms of the corresponding properties of 
the input and the characteristics of the system. 


2.2.1 Time-Domain Analysis 


The first question to ask when we apply a random signal to a system is, just what is the meaning of such an operation? 
We ask this because a random process is not just a single sequence but an ensemble of sequences (see Section 2.1). 
However, since each realization of the stochastic process is a deterministic signal, it is an acceptable input producing 
an output that is clearly a single realization of the output stochastic process. For an LTI system, each pair of 
input-output realizations is described by the convolution summation 


yin, f= h(k)x(n—k, ) (2.2.1) 


k=-00 


If the sum in the right side of (2.2.1) exists for all ¢ such that Pr{¢}=1, then we say that we have 
almost-everywhere convergence or convergence with probability 1 (Papoulis 1991). The existence of such 
convergence is ruled by the following theorem (Brockwell and Davis 1991). 


THEOREM 2.1. If the process x(n,¢) is stationary with E{|x(n,¢)|}<0oo and if the system is BIBO-stable, that is, 


5i h(k) |< co > then the output y(n,¢) of the system in (2.2.1) converges absolutely with probability 1, or 


y(ng)= Ý hlk)x(n=k,¢) forall eA, Pr{A}=1 (2.2.2) 


k=-00 


and is stationary. Furthermore, if E{| x(n,¢)| }<œ, then E{| y(n,£)[?}<oo and y(n,¢) converges in the mean square to 
the same limit and is stationary. 


A less restrictive condition of finite power on the system impulse response h(n) also guarantees the mean 
square existence of the output process, as stated in the following theorem. 


THEOREM 2.2. If the process x(n,¢) is zero-mean and stationary with >| r (D< > and if the system (2.2.1) satisfies the 


l=-<o 


condition 


— 2 1 j@» |2 
È law P= [Ae Pda<e (2.2.3) 


then the output y(n,¢) converges in the mean square sense and is stationary. 


The above two theorems are applicable when input processes have finite variances. However, IID sequences 
with œ -stable distributions have infinite variances. If the impulse response of the system in (2.2.1) decays fast 
enough, then the following theorem (Brockwell and Davis 1991) guarantees the absolute convergence of y(n, 2) 
with probability 1. These issues are of particular importance for inputs with high variability and are discussed in 
Section 2.1.5. 


THEOREM 2.3. Let x(n,¢) be an IID sequence of random variables with œ -stable distribution, 0<a@<2. If the impulse 
response h(n) satisfies 


p3 | h(n) |? < co for some ĝe (0,@) 


n=-0o 


then the output y(n,¢) in (2.2.1) converges absolutely with probability 1. 


Clearly, a complete description of the output stochastic process y(n) requires the computation of an infinite 
number of convolutions. Thus, a better alternative would be to determine the statistical properties of y(n) in terms 
of the statistical properties of the input and the characteristics of the system. For Gaussian signals, which are used 
very often in practice, first- and second-order statistics are sufficient. 
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Output mean value. If x(n) is stationary, its first-order statistic is determined by its mean value y,. To 
determine the mean value of the output, we take the expected value of both sides of (2.2.1): 


Hy = x A(k)E{x(n—k)} = 4, $ h(k) = 4,H (e”) (2.2.4) 


k=—% k=-0o 


Since 4x and H (e?) are constant, 4y is also constant. Note that H (e’°) is the dc gain of the spectrum. 


Input-output cross-correlation. If we take complex conjugate of (2.2.1), premultiply it by x(n+/), and take 
the expectation of both sides, we have 


E{x(n+l)y (n)}= py h’(k)E{x(n+1)x' (n—-k)} 


k=-00 


or hy (l) = Fr Oral +k)= Fk mrali -m) 

k=-00 m=—oco 
Hence, r, ()=h'(-1)* r (D (2.2.5) 
Similarly, r (1) =h() *r,,() (2.2.6) 


Output autocorrelation. Postmultiplying both sides of (2.2.1) by y*(n—l) and taking the expectation, we 
obtain 


E{y(n)y'(n—D} = AUOE(x(n—k)y"(n—D} (2.2) 
= AL = È h(k)r,,(l-k) =h() *r,, D (2.2.8) 
k=-<o 
From (2.2.5) and (2.2.8) we get 
r, (D =h) * k" (D) * r (1) (2.2.9) 
or r (1) =r (D*r (D) (2.2.10) 
where n (I) £h(l)*h' (-l) = 3 h(n)h’ (n—1) (2.2.11) 


is the autocorrelation of the impulse response and is called the system correlation sequence. 

Since ly is constant and 7,(/) depends only on the lag l, the response of a stable system to a stationary 
input is also a stationary process. A careful examination of (2.2.10) shows that when a signal x(n) is filtered by an 
LTI system with impulse response h(n) its autocorrelation is “filtered” by a system with impulse response equal to 
the autocorrelation of its impulse response, as shown in Figure 2.4 


FIRUGRE 2.4 


An equivalent LTI system for autocorrelation filtration. 


Output power. The power E{| y(n)|’} of the output process y(n) is equal to r,(0), which from (2.2.9) and 
(2.2.10) and the symmetry property of 7,(1) is 
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oo co 


P, =r O= OEOD = > nr, (k= © ak) *h’ (Hk), (k) 


k=-00 k=—% 


= È È h(m)h*(m-k)r,(k) (2.2.12) 


k=—œ m=—co 


= Š r,(k)r,(k) (2.2.13) 


k=-00 


or for FIR filters with h=[h(O) h(1) «+» h(M —1)]", (2.2.12) can be written as 
P, =h"R h (2.2.14) 


Finally, we note that when 4x =0,we have fy =O and o? =P. 


Output probability density function. Finding the probability density of the output of an LTI system is very 
difficult, except in some special cases. Thus, if x(n) is a Gaussian process, then the output is also a Gaussian 
process with mean and autocorrelation given by (2.2.4) and (2.2.10). Also if x(n) is IID, the probability density of 
the output is obtained by noting that y(n) is a weighted sum of independent random variables. Indeed, the 
probability density of the sum of independent random variables is the convolution of their probability densities or the 
products of their characteristic functions. Thus if the input process is an IID stable process then the output process is 
also stable whose probability density can be computed by using characteristic functions. 


2.2.2 Frequency-Domain Analysis 
To obtain the output autoPSD and complex autoPSD, we recall that if H(z) = Z{h(n)}, then, for real h(n), 





Zik (n) -1'(4) (2.2.15) 
& 
From (2.2.5), (2.2.6), and (2.2.7) we obtain 
Ror} JRO (2.2.16) 
R „(z)= H(2)R,(2) (2.2.17) 
and R,(z)=H(z)H" (=) R,(z) (2.2.18) 
j gZ 
For a stable system, the unit circle z = e je Ties within the ROCs of H (z) and H (z") . Thus, 
R (e°) = H* (e'”)R,(e!”) (2.2.19) 
R „(@®) = H (et?)R (e°) (2.2.20) 
and R, (e?) = H(e!”)H"(e'”)R, (e!”) (2.2.21) 
ti R, (e°) =| H (e°)? Re”) (2.2.22) 


Thus, if we know the input and output autocorrelations or autospectral densities, we can determine the magnitude 
response of a system, but not its phase response. Only cross-correlation or cross-spectral densities can provide phase 
information [see (2.2.19) and (2.2.20)]. 

It can easily be shown that the power of the output is 


E{| y(n) }=r, (0) = = f | H(e’”) P R (e)dø (2.2.23) 
2n = 
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=>) DrD (2.2.24) 


|=—co 
which is equivalent to (2.2.13). 
Consider now a narrowband filter with frequency response 


1 Oo, — id <@<sa.+ 2E 
H (e°) = e 2 ° 2 (2.2.25) 
0 elsewhere 
The power of the filter output is 
E 2) _ 1 +a 2 jø jas 
lyw p n a R jao=R e") (2.2.26) 


assuming that aœ is sufficiently small and that R,(e!”) is continuous at @=@.. Since E{| y(n) P}20, 
R.(e“) is also nonnegative forall @ and aq@, hence 


R (e)20 -nS@<n (2.2.27) 


Hence, the PSD R, (e!”) is nonnegative definite for any random sequence x(n) real (or complex). Furthermore, 
R,(e’”)da@/(2x) , has the interpretation of power, or R,(e’”) is a power density as a function of frequency (in 
cycles per second). Table 2.2 shows various input-output relationships in both the time and frequency domains. 

Table 2.2. 

Second-order moments of stationary random sequences processed by linear,time-invariant systems. 


Time domain Frequency domain z Domain 
y(n) = h(n) * x(n) Not available Not available 
r) = A(L) * r (0) R,,(e”) = H(e!”)R,(e'”) R,,(z) = H(z)R,(z) 
ry (1) =k (-1) * r0) R,, (e”) = H*(e’*)R,(e"”) R,,(z) =H (W/z")R, (2) 
rD =h)*r, R, (e°) = H(e)R (e°) R,(z) = H (2)R (2) 
r (D =h) * k (D * r (D R (e°) =| H(e!”) ? R (e°) R (z) = H(z)H*(1/z°)R (z) 


2.2.3 Random Signal Memory 


Given the “zero-memory” process an) ~ IJD(0,02), we can introduce dependence by passing it through an LTI 
system. The extent and degree of the imposed dependence are dictated by the shape of the system’s impulse response. 
The probability density of @(n) is not explicitly involved. Suppose now that we are given the resulting linear 
process x(n), and we want to quantify its memory. For processes with finite variance we can use the correlation 
length 


1< " 
L.=—) r= l 
e 70% (D 2 Pal ) 
which equals the area under the normalized autocorrelation sequence curve and shows the maximum distance at 
which two samples are significantly correlated. 

An IID process has no memory and is completely described by its first-order density. A linear process has 
memory introduced by the impulse response of the generating system. If q@(n) has finite variance, the memory of 
the process is determined by the autocorrelation of the impulse response because r,(/)=0n,(1). Also, the 
higher-order densities of the process are nonzero. Thus, the variability of the output—that is, what amplitudes takes 
the signal, how often, and how fast the amplitude changes from sample to sample—is the combined effect of the 
input probability density and the system memory. 


DEFINITON 2.7. A stationary process x(n) with finite variance is said to have long memory if there exist constants 
a,0<a<l,and C,>0_ such that 





lim— r,()I* =1 


C, o; 


This implies that the autocorrelation has fat or heavy tails, that is, asymptotically decays as a power law 
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pAD=C, |17 as | — œ% 
and slowly enough that 


Y pdas 


[=~<0 


that is, a long-memory process has infinite correlation length. If 


AORE 


l=— 
we say that that the process has short memory. This is the case for autocorrelations that decay exponentially, for 
example, p,(/)=a"l,-l<a<1. 
An equivalent definition of long memory can be formulated in terms of the power spectrum (Beran 1994; 
Samorodnitsky and Taqqu 1994). 


DEFINTION 2.8. A stationary process x(n) with finite variance is said to have long memory if there exist constants 
B,0< B<1,and Cy >0 such that 


1 





li R,(e*)|@P=1 
= Fa Ce”) | ol 
This asymptotic definition implies that 
jo CrO; 
R,(e’ rr) as a—0 
|| 
and RO=} 7.) = 


[=-<0 

The first-order density determines the mean value and the variance of a process, whereas the second-order 
density determines the autocorrelation and power spectrum. There is a coupling between the probability density and 
the autocorrelation or power spectrum of a process. However, this coupling is not extremely strong because there are 
processes that have different densities and the same autocorrelation. Thus, we can have random signal models with 
short or long memory and low or high variability. Random signal models are discussed in Chapters 3. 


2.2.4 General Correlation Matrices 


We first begin with the properties of general correlation matrices. Similar properties apply to covariance matrices. 


PROPERTY 2.2.1. The correlation matrix of a random vector x is conjugate symmetric or Hermitian, that is, 
R, =R" (2.2.28) 
Proof This follows easlily from (2.2.19). 
PROPERTY 2.2.2. The correlation matrix of a random vector X is nonnegative definite (n.n.d.); or for every nonzero complex 
vector w=[w w: --- ww ]', the quadratic form w"R,w_ is nonnegative, that is, 
w”R w20 (2.2.29) 
Proof. To prove (2.2.29), we define the dot product 
a=wixy=x'w'= 3 wX, (2.2.30) 
k=1 
The mean square value of the random variable œ is 
E{|aP}= E(w" xx"w}=w"E{xx"}w =w Rw (2.2.31) 
Since E{| æ |} 20 , if follows that w” Ryw >0. We also note that a matrix is called positive definite (p.d.) if W "Rw >0. 


Eigenvalues and eigenvectors of R 


For a Hermitian matrix R we wish to find an M x1 vector q that satisfies the condition 
Rq=Aq (2.2.32) 
where A is a constant. This condition implies that the linear transformation performed by matrix R does not 
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change the direction of vector q . Thus Rq is a direction-invariant mapping. To determine the vector q , we write 
(2.2.32) as 
(R-Al)q =0 (2.2.33) 

where IJ is the MxM _ identity matrix and 0 is an M X1 vector of zeros. Since q is arbitrary, the only way 
(2.2.33) is satisfied is if the determinant of R-—AI equals zero, that is, 

det(R — AI) =0 (2.2.34) 
This equation is an M th-order polynomial in A and is called the characteristic equation of R .Ithas M roots 
{4}, called eigenvalues, which, in general, are distinct. If (2.2.34) has repeated roots, then R is said to have 
degenerate eigenvalues. For each eigenvalue A; we can satisfy (2.2.32) 


Rq, = âq; i=1,---,M (2.2.35) 


where the q, are called eigenvectors of R . Therefore, the MxM matrix R has M eigenvectors. To 
uniquely determine q;, we use (2.2.35) along with the normality condition that “4 |=1. A MATLAB function 
[Lambda,Q] = eig (R) is available to compute eigenvalues and eigenvectors of R . 

There are further properties of the autocorrelation matrix R based on its eigenanalysis, which we describe 
below. Consider a matrix R that is Hermitian and nonnegative definite (w” Rw 20) with eigenvalues {4;}% 
and eigenvectors {q; }% . 


PROPERTY 2.2.3. The matrix R‘(k =1,2,---) has eigenvalues 4% ,⁄%,..., A% - 

Proof. See Problem 2.16. 

PROPERTY 2.2.4. If the eigenvalues ,,A,,...,Ay are distinct, the corresponding eigenvectors {g;}!, are linearly independent. 
Proof. This property can be proved by using Property 2.2.3 Given M not-all-zero scalars {@;}!4,, if 


M 
> ag, =0 (2.2.36) 
i=l 


then the eigenvectors {q; }i4, are said to be linearly dependent. Assume that (2.2.36) is true for some not-all-zero scalars {qj i 
and that the eigenvalues {;}!, are distinct. Now multiply (2.2.36) repeatedly by R*, k=0,+:-,M—1 and use Property 2.2.3 
to obtain 


M M 
> Rg, => aAtqg,=0 = k=0,---,M-1 (2.2.37) 
i=l i=l 


which can be arranged in a matrix format for j=1,---,M as 


1 
[aq 9, 29; .--- Oy In |) =0 (2.2.38) 


lay Bo» A 
Since all the A; are distinct, the matrix containing the A; in (2.2.38) above is nonsingular. This matrix is called a Vandermonde 
matrix. Therefore, premultiplying both sides of (2.2.38) by the inverse of the Vandermonde matrix, we obtain 


[&q, O24. Qa; °° AyQy1=9 (2.2.39) 


Since eigenvectors {q;}!4, are not zero vectors, the only way (2.2.39) can be satisfied is if all {q; }, are zero. This implies that 
(2.2.36) cannot be satisfied for any set of not-all-zero scalars {q;}/,, which further implies that {q;}/, are linearly independent. 
PROPERTY 2.2.5. The eigenvalues {;}!, are real and nonnegative. 

Proof. From (2.2.35), we have 


qï Rq,=Aq'q,  i=1,2,---,M (2.2.40) 
Since R is positive semidefinite, the quadratic form q” Rq, 2 0. Also since qi'qi is an inner product, qq: > 0. Hence 
H 
| Rq, 
A= E30 §=1,2,-,M (2.2.41) 
qi 4; 


Furthermore, if R is positive definite, then A; >0 forall 1<i<M . The quotient in (2.2.41) is a useful quantity and is known 
as the Raleigh quotient of vector q; . 


PROPERTY 2.2.6. If the eigenvalues {/;}!, are distinct, then the corresponding eigenvectors are orthogonal to one another, that is, 
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A#4,>9'9q,=0 fori#] (2.2.42) 

Proof. Consider (2.2.35). We have 
Rq,=44; (2.2.43) 
and Rq, = Àq; (2.2.44) 


For some j # j, premultiplying both sides of (2.2.43) by q'' , we obtain 

q; Rq; =q; 44; = 444; (2.2.45) 
Taking the conjugate transpose of (2.2.44), using the Hermitian property (2.2.35) of R , and using the realness Property 2.2.5 of 
eigenvalues, we get 


q;R=A4; (2.2.46) 
Now postmultiplying (2.2.46) by q; and comparing with (2.2.45), we conclude that 
AG; =49;9 oO (A-A,)gi'g, =0 (2.2.47) 


Since the eigenvalues are assumed to be distinct, the only way (2.2.47) can be satisfied is if gig, =0 for i+ j, which further 
proves that the corresponding eigenvectors are orthogonal to one another. 


PROPERTY 2.2.7. Let {q;}/4, be an orthonormal set of eigenvectors corresponding to the distinct eigenvalues {A}, of an 
MxM correlation matrix R.Then R can be diagonalized as follows: 


A=Q"RO (2.2.48) 
where the orthonormal matrix Q 4 [q "qu ] is known as an eigenmatrix and A is an MXM diagonal eigenvalue matrix, 
that is, 

A= diag(A,-+-,Ay) (2.2.49) 


Proof. Arranging the vectors in (2.2.35) in a matrix format, we obtain 
[Rq, Rg, = Rayl=lAg 92 0 Av dw 
which, by using the definitions of @ and A, can be further expressed as 
RQ=QA (2.2.50) 
Since g;,i=1,---,M , is an orthonormal set of vectors, the eigenmatrix Q is unitary, that is, Q™' =Q". Now premultiplying 
both sides of (2.2.50) by Q", we obtain the desired result. 


This diagonalization of the autocorrelation matrix plays an important role in filtering and estimation theory, as 
we shall see later. From (2.2.48) the correlation matrix R can also be written as 


M 
R=QAQ" = Aggy ++ Aydu dn = > Ananin (2.2.51) 
m=1 


which is known as the spectral theorem, or Mercer’s theorem. If R is positive definite (and hence invertible), its 
inverse is given by 


M 
R` =(QAQ")"' =QA'Q" = Yaa" (2.2.52) 
m=| m 


because A is a diagonal matrix. 


PROPERTY 2.2.8. The trace of R is the summation of all eigenvalues, that is, 
M 
tr(R) => 4, (2.2.53) 
i=l 


Proof. See Problem 2.17. 
PROPERTY 2.2.9. The determinant of R is equal to the product of all eigenvalues, that is, 


M 
det R =| R|}=[ [444] (2.2.54) 


i=l 


Proof. See Problem 2.18. 
PROPERTY 2.2.10. Determinants of R and Į are related by 
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IRAP |0+ T 4) (2.2.55) 
Proof. See Problem 2.19. 


T, is the autocovariance matrix, defined by 

Hr “S. My 

T, = EX (E-a MX (E-F io. i 
Yui °° Yum 

2.2.5 Correlation Matrices from Random Processes 


A stochastic process can also be represented as a random vector, and its second-order statistics given by the mean 
vector and the correlation matrix. Obviously, these quantities are functions of the index n. Let an M x1 random 
vector x(n) be derived from the random process x(n) as follows: 


x(n) =[x(n) x(n-1) -- x(n-M +)D]" (2.2.56) 


Then its mean is given by an M X1 vector 


4. (n)=[4,(n) 4, (n-1) + u, (n-M +1)" (2.2.57) 
and the correlation by an MXM matrix 
r (n,n) ee r(n,n—M +1) 
R (n)= t (2.2.58) 
r(n—-M+1,n) =- r(n-M +1,n—-M +1) 


Clearly, R,(n) is Hermitian since r,(n—i,n—j)=r(n— j,n—i),0 <i, j <M —1. This vector representation will 
be useful when we discuss optimum filters. 


Correlation matrices of stationary processes 


The correlation matrix R,(n) of a general stochastic process x(n) isa Hermitian MXM matrix defined in 
(2.2.58) with elements r,(n—i,n—j) = E{x(n—i)x (n— j)}. For stationary processes this matrix has an interesting 
additional structure. First, R,(m) is aconstant matrix R, ; then using (2.1.24), we have 


r(n-in-jy=r(j-D=r@Q (4 j-i) (2.2.59) 


Finally, by using conjugate symmetry r,(/) = rý (—/), the matrix R, is given by 


r (0) r (1) r (2) s K (M -D 
r (1) r (0) r (1) s (M —2) 
R,=| (2 r` (1) r(O) -= r(M-3) (2.2.60) 
r'(M-1) r(M-2) r'(M-3) -= r (0) 


It can be easily seen that R, is Hermitian and Toeplitz.” Thus, the autocorrelation matrix of a stationary process is 
Hermitian, nonnegative definite, and Toeplitz. 


Eigenvalue spread and spectral dynamic range 


The ill conditioning of a matrix R, increases with its condition number -V(R,) = Amax/Amin. When R, isa 


3 matrix is called Toeplitz if the elements along each diagonal, parallel to the main diagonal, are equal. 
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correlation matrix of a stationary process, then Y(R.) is bounded from above by the dynamic range of the PSD 
R,(e”) of the process x(n). The larger the spread in eigenvalues, the wider (or less flat) the variation of the PSD 
function. This is also related to the dynamic range or to the data spread in x(n) and is a useful measure in practice. 
This result is given by the following theorem, in which we have dropped the subscript of R, (e!”) for clarity. 


THEOREM 2.4. Consider a zero-mean stationary random process with autoPSD 


© 


Rie) = ¥ rie" 


[=—c0 


then min R(e!’) < A, < max R(e’”) for alli =1,2,---,M (2.2.61) 


w 


Proof. From (2.2.41) we have 





H 
; Rq; 
a= (2.2.62) 
q; 4; 
Consider the quadratic form 
M M 
gq; Rq;= > 2,4 Orl- 0q) 
k=1 [=I 
where g; =[gi(1) gi(2) --- qi(M )]’ . Using (2.1.41) and the stationarity of the process, we obtain 
gR, =E Daath [Ree de 
T k ı i 
1 [m a | M sa] (2.2.63) 
=—| R(e’”)| (ke (lye dæ 
On f p> q; ( l2 qi | 
i q; Rq, =— f R(e’”) | Qe) F dæ (2.2.64) 
2n Az 
Similarly, we have 
T _ i j@y 2 
959, => [jae da (2.2.65) 
Substituting (2.2.64) and (2.2.65) in (2.2.62), we obtain 
[lo Redo 
aie (2.2.66) 


floc) fF de 
T 
However, since R(e!”) > 0 , we have the following inequality: 
min R(e”) f |Q(e*) P do < f |Q)? Redos max Ree”) [| Ole”) f dw 
w m -H oO us 
from which we easily obtain the desired result. The above result also implies that 


aa max R(e’”) 
(R) S S minke) (2.2.67) 


min 


which becomes equality as M — oo. 


2.3 Innovations Representation of Random Vectors 


In many practical and theoretical applications, it is desirable to represent a random vector (or sequence) with a 
linearly equivalent vector (or sequence) consisting of uncorrelated components. If x is a correlated random vector 
andif A is a nonsingular matrix, then the linear transformation 

w=Ax (2.3.1) 
results in a random vector w that contains the same “information” as x , and hence random vectors xX and w 
are said to be linearly equivalent. Furthermore, if w is an uncorrelated random vector, then each component w, of 
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w can be thought of as adding “new” information (or innovation) to w that is not present in the remaining 
components. Such a representation is called an innovations representation and provides additional insight into the 
understanding of random vectors and sequences. Additionally, it can simplify many theoretical derivations and can 
result in computationally efficient implementations. 

Since [, must be a diagonal matrix, we need to diagonalize the Hermitian, positive definite matrix IT, 
through the transformation matrix A . There are two approaches to this diagonalization. One approach is to use the 
eigenanalysis presented in Section 2.2.4, which results in the well-known Karhunen-Loéve (KL) transform. The other 
approach is to use triangularization methods from linear algebra, which leads to the LDU (UDL) and LU (UL) 
decompositions. These vector techniques can be further extended to random sequences that give us the KL expansion 
and the spectral factorizations, respectively. 

Here, only the transformation using eigendecomposition are discassed. 


Transformations Using Eigendecomposition 
Let x bearandom vector with mean vector 4, and covariance matrix I,. The linear transformation 


Xo 5X- (2.3.2) 


results in a zero-mean vector X 9 with correlation (and covariance) matrix equal to I’, . This transformation shifts 
the origin of the M -dimensional coordinate system to the mean vector. We will now consider the zero-mean 
random vector xo for further transformations. 


Orthonormal transformation 


Let Q, bethe eigenmatrix of T, , and let us choose Q7 as our linear transformation matrix A . Consider 


w =Q; x, =Q; (x-4) (2.3.3) 
Then l, =Q"(E{x,}) =0 (2.3.4) 
and T, =R, =ElQ, x x0, }= 0T, Q, =A, (2.3.5) 


Since A, is diagonal, I, is also diagonal, and hence this transformation has some interesting properties: 

1. The random vector w has zero mean, and its components are mutually uncorrelated (and hence orthogonal). 
Furthermore, if x is N(f.,T,),then w is N(0,A,) with independent components. 

2. The variances of random variables w,,i=1,---,M , are equal to the eigenvalues of I, . 

3. Since the transformation matrix A=Qf' is orthonormal, the transformation is called an orthonormal 
transformation and the distance measure 


I'S) = 251 My (2.3.6) 


is preserved under the transformation. This distance measure is 
also known as the Mahalanobis distance; and in the case of normal 
random vectors, it is related to the log-likelihood function. 
4. Since w =Q} (x -— 4), we have 


w, =q" (x -4,) Hl x — 4, Ilcos[<(x- 4,4]  i=1,---,M 


(2.3.7) 
which is the projection of x — 4x onto the unit vector g;. Thus w 
represents x in a new coordinate system that is shifted to “, and 
spanned by q;,i=1,---,M . A geometric interpretation of this 
transformation for a two-dimensional case is shown in Figure 2.5, 
which shows a contour of d*(x9)=x"T;'x =w"A;'w in the x 
and w coordinate systems (w = (0x). 





FIGURE 2.5 
Orthogonal transformation in two dimensions. 
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Isotropic transformation 


In the above orthonormal transformation, the autocorrelation matrix R,, is diagonal but not an identity matrix 
I . This can be achieved by an additional linear mapping of AĮ"? . Let 


Y= Aw =A OE ey = ALON (x-y, ) eam 


Then R, = AP? OT, Q, A” =A A,A” =I (2.3.9) 


This is called an isotropic transformation because all components of y are zero-mean, uncorrelated random 
variables with unit variance. The geometric interpretation of this transformation for a two-dimensional case is 
shown in Figure 2.6. It clearly shows that there is not only a shift and rotation but also a scaling of the coordinate axis 
so that the distribution is equal in all directions, that is, it is direction-invariant. Because the transformation 
A =A; Q} is orthogonal but not orthonormal, the distance measure d*(x 9) is not preserved under this mapping. 
Since the correlation matrix after this transformation is an identity matrix J, it is invariant under any orthonormal 
mapping, that is, 


O"10=0'0=1 (2.3.10) 


This fact can be used for simultaneous diagonalization of two Hermitian matrices. 


i 


1 
4 
¢ . 
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yy 
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xı 


FIGURE 2.6 
Isotropic transformation in two dimensions. 


EXAMPLE 2.3.1. Consider a stationary sequence with correlation matrix 


where —1<a <1. The eigenvalues 
A=l+a A =l-a 
are obtained from the characteristic equation 


J |-a-27-0: =0 


det(R. — AI) = det lsh 
et(R, — AI) =de 
x a 1-a 


To find the eigenvector q, , we solve the linear system 
0) 
qı 


l a _ lq! 
b i m =a) 


which gives g\” =qS'. Similarly, we find that g‘” =—g‘”. If we normalize both vectors to unit length, we obtain the 











eigenvectors 


4 z , : Stan ees i : ; i ; 
In the literature, an isotropic transformation is also known as a whitening transformation. We believe that this terminology is not accurate 


H -1/2 nH ; 
because both vectors Q. Xo and A F Q. Xo have uncorrelated coefficients. 
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From the above results we see that det R, =1 —@? = A,A, and Q”Q =I , where Q =[q qo). 


2.4 Principles of Estimation Theory 


The key assumption underlying our discussion up to this point was that the probability distributions associated with 
the problem under consideration were known. As a result, all required probabilities, autocorrelation sequences, and 
PSD functions either could be derived from a set of assumptions about the involved random processes or were given 
a priori. However, in most practical applications, this is the exception rather than the rule. Therefore, the properties 
and parameters of random variables and random processes should be obtained by collecting and analyzing finite sets 
of measurements. In this section, we introduce some basic concepts of estimation theory that will be used repeatedly 
in the rest of the book. Complete treatments of estimation theory can be found in Kay (1993), Helstrom (1995), Van 
Trees (1968), and Papoulis (1991). 


2.4.1 Properties of Estimators 


Suppose that we collect N observations {x(n)}Ņ! from a stationary stochastic process and use them to estimate a 
parameter @ (which we assume to be real-valued) of the process using some function AL { x(n) } -!]. The same 
results can be used for a set of measurements {x,()};* obtained from N sensors sampling stochastic processes 
with the same distributions. The function @[{x(n)}))'] is known as an estimator whereas the value taken by the 
estimator, using a particular set of observations, is called a point estimate or simply an estimate. The intention of the 
estimator design is that the estimate should be as close to the true value of the parameter as possible. However, if we 
use another set of observations or a different number of observations from the same set, it is highly unlikely that we 
will obtain the same estimate. As an example of an estimator, consider estimating the mean 44, of a stationary 
process x(n) from its N observations {x(n)})-'. Then the natural estimator is a simple arithmetic average of 


these observations, given by 
N-I 


r ae: 
A,=A{x(n)})']= yn) (2.4.1) 


n=0 
Similarly, a natural estimator of the variance ø} of the process x(n) would be 


N-I 
o? = A{x(n)}) "= LS [x(n)- â Ý (2.4.2) 
n=0 

If we repeat this procedure a large number of times, we will obtain a large number of estimates, which can be 
used to generate a histogram showing the distribution of the estimates. Before the collection of observations, we 
would like to describe all sets of data that can be obtained by using the random variables {x(n,¢)}j'. The obtained 
set of N observations {x(n)}j' can thus be regarded as one realization of the random variables {x(n)}j" 
defined on an WN -dimensional sample space. In this sense, the estimator O[{x(n,¢)}{ 1] becomes a random 
variable whose distribution can be obtained from the joint distribution of the random variables {x(n)}\. This 
distribution is called the sampling distribution of the estimator and is a fundamental concept in estimation theory 
because it provides all the information we need to evaluate the quality of an estimator. 

The sampling distribution of a “good” estimator should be concentrated as closely as possible about the 
parameter that it estimates. To determine how “good” an estimator is and how different estimators of the same 
parameter compare with one another, we need to determine their sampling distributions. Since it is not always 
possible to derive the exact sampling distributions, we have to resort to properties that use the lower-order moments 
(mean, variance, mean square error) of the estimator. 

Bias of estimator. The bias of an estimator 6 ofa parameter @ is defined as 


B(6) ê E[6]-@ (2.4.3) 


while the normalized bias is defined as 


4 a 0+0 (2.4.4) 
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When B(6) = (0), the estimator is said to be unbiased and the pdf of the estimator is centered exactly at the true value 
0. Generally, one should select estimators that are unbiased such as the mean estimator in (2.4.1) or very nearly 
unbiased such as the variance estimator in (2.4.2). However, it is not always wise to select an unbiased estimator, as 
we will see below and in Section 4.2 on the estimation of autocorrelation sequences. 

Variance of estimator. The variance of the estimator Ê is defined by 


var(8) = 0} = E{| Ô- E{Ô} P} (2.4.5) 


which measures the spread of the pdf of Ê around its average value. Therefore, one would select an estimator with 
the smallest variance. However, this selection is not always compatible with the small bias requirement. As we will 
see below, reducing variance may result in an increase in bias. Therefore, a balance between these two conflicting 
requirements is required, which is provided by the mean square error property. The normalized standard deviation 
(also called the coefficient of variation) is defined by 


A O; 
€,=—% 0+0 (2.4.6) 
0 
Mean square error. The mean square error (MSE) of the estimator is given by 
MSE(6) = E{|ĝ-0 F }=0}+|BP (2.4.7) 
Indeed, we have 


MSE(@) = E{| 0- E{ĝ}- (ĝ- E{6})[°} 
= E{| 0- E{ĝ} P? }+ E{|O-E{} P} (2.4.8) 
-(0- E{0})E{(6— EÂ} }— (0 - E{6}) EÔ- E{6}} 


=|0-—E{O}? +E{|O— EÔ} (2.4.9) 


which leads to (2.4.7) by using (2.4.3) and (2.4.5). Ideally, we would like to minimize the MSE, but this minimum is 
not always zero. Hence minimizing variance can increase the bias. The normalized MSE is defined as 
g ATSE) 0+0 (2.4.10) 


Cramér-Rao lower bound. If it is possible to minimize the MSE when the bias is zero, then clearly the variance 
is also minimized. Such estimators are called minimum variance unbiased estimators, and they attain an important 
minimum bound on the variance of the estimator, called the Cramér-Rao lower bound (CRLB), or minimum variance 
bound. If 6 is unbiased, then it follows that E {ĝ- 0} =0, which may be expressed as 


F e [@- fo 0)dx =0 


where x(£)=[x(f), x2(£),-*:,xv(G)]’ and fxo(x;0) is the joint density of x(¢), which depends on a fixed 
but unknown parameter @. If we differentiate (2.4.11) with respect to @, assuming real-valued 0, we obtain 


0= P [516-0 f(O = f . fê- 9) 20 tals OD ae 


(2.4.11) 


=l (2.4.12) 
Using the fact 


dnl fL 1 fro E0) 
00 fg) 00 
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Of 98) Inf. 9(x54)] 
> Fg) OD p f g0) (2.4.13) 


00 ~ 00 x 
and substituting (2.4.13) in (2.4.12), we get 


a dln f..p(XsO 
Ef (B= 9) EA y d= (2.4.14) 
00 
Clearly, the left side of (2.3.14) is simply the expectation of the expression inside the brackets, that is , 
i) Inf fy. (x; 8)] 
00 


Using the Cauchy-Schwarz inequality (Papoulis 1991; Stark and Woods 1994) 
| E{x(E)y(E)} Ps E{l x(S) P }E{I y(S) l? } we obtain 


E{(@-@) }=1 (2.4.15) 


7 dIn[ f..9(x:@)]) ~ dlni f(x] 
EL(0 -0P y| —— =A > E? (0-0 —=1 el (2.4.16) 
{( ) H{ 0 ( ) 70 
The first term on the left-hand side is the variance of the estimator @ since it is unbiased. Hence 
var(8) > eS eee (2.4.17) 


E{dIn[ f..g(x:0)/ 00} } 


which is on e form of the CRLB and can also be expressed as 


var(ĝ) > ee P (2.4.18) 


E{d° In f..4(x;0)/00"} 


The function in f;,9(x;@) is called the log likelihood function of @. The CRLB expresses the minimum error 
variance of any estimator O of @ in terms of the joint density f,.9(x;@) of observations. Hence every unbiased 
estimator must have a variance greater than a certain number. An unbiased estimate that satisfies the CRLB (2.4.18) 
with equality is called an efficient estimate. If such an estimate exists, then it can be obtained as a unique solution to 
the likelihood equation 
IN fg (%9) 
00 

The solution of (2.4.19) is called the maximum likelihood (ML) estimate. Note that if the efficient estimate does not 
exist, then the ML estimate will not achieve the lower bound and hence it is difficult to ascertain how closely the 
variance of any estimate will approach the bound. The CRLB can be generalized to handle the estimation of vector 
parameters (Therrien 1992). 

Consistency of estimator. If the MSE of the estimator can be made to approach zero as the sample size N 
becomes large, then from (2.4.7) both the bias and the variance will tend to zero. Then the sampling distribution will 
tend to concentrate about @, and eventually as N — œ, the sampling distribution will become an impulse at @. 
This is an important and desirable property, and the estimator that possesses it is called a consistent estimator. 

Confidence interval. If we know the sampling distribution of an estimator, we can use the observations to 
compute an interval that has a specified probability of covering the unknown true parameter value. This interval is 
called a confidence interval, and the coverage probability is called the confidence level. When we interpret the 
meaning of confidence intervals, it is important to remember that it is the interval that is the random variable, and not 
the parameter. This concept will be explained in the sequel by means of specific examples. 


2.4.2 Estimation of Mean 


The natural estimator of the mean 4, of a stationary sequence x(n) from the observations {x(n)})' is the 
sample mean, given by 


0 (2.4.19) 
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1 NA 
Rar > x(n) (2.4.20) 
n=0 


The estimate Ê, is a random variable that depends on the number and values of the observations. Changing N or 
the set of observations will lead to another value for /,. Since the mean of the estimator is given by 
E{AJ= 4, (2.4.21) 
the estimator A, is unbiased. If x(n) ~ WN( Lx, Ox ), we have i 
2 


a 0 
var( fl.) = n 





(2.4.22) 


because the samples of the process are uncorrelated random variables. This variance, which is a measure of the 
estimator’s quality, increases if x(n) is nonwhite. 
Indeed, for a correlated random sequence, the rie of A, is wines by (see Problem 2.30) 


var(t,) =N" $h- ll), SN" x 17.0! (2.4.23) 


l=-N 
where Y,(/) is the covariance sequence of x(n). If 7,7) 0 as l>~, then var(f,)-0 


as N— œ andhence Ê, isa consistent estimator of /, . If X) y.(D|<œ, then from (2.4.23) 


l=— 





lim Nvar(ĝ,) = lim x (1 -I 7.) = 5 AU) (2.4.24) 
l=—00 
The expression for var(f,) in (2.4.23) can also be put in the form (see Problem 2.30) 
ee 
var( fl.) = N 1+4, (p,)] (2.4.25) 
u l l 
where Ay(p,) = $ (i -+) PpD p= A (2.4.26) 
1=0 N O, 


Since Ay(p,) = 0, the variance of the estimator increases as the amount of correlation among the samples of x(n) 
increases. This implies that as the correlation increases, we need more samples to retain the quality of the estimate 
because each additional sample carries “less information.” For this reason the estimation of long-memory processes 
and processes with infinite variance is extremely difficult. 

Sampling distribution. If we know the joint pdf of the random variables {x(n)})~', we can determine, at least 
in principle, the pdf of ,. For example, if it is assumed that the observations are IID as ~~ (4,0?) then from 
(2.4.21) and (2.4.23), it can be seen that Ê, is normal with mean //, and variance o?/N,, that is, 


iy 1 U, Ux j 
=- 2.4.27 
fa B) Janod | 12a) ii 


which is the sampling distribution of the mean. If N is large, then from the central limit theorem, the sampling 
distribution of the sample mean (2.4.27) is usually very close to the normal distribution, even if the individual 
distributions are not normal. 

If we know the standard deviation O,, we can compute the probability 


Pr Kena <p sil (2.4.28) 
x JN x x JN ot. 


that the random variable #2, is within a certain interval specified by two fixed quantities. A simple rearrangement of 
the above inequality leads to 


prda 4922p ef sil (2.4.29) 
Ly JN x Ux JN ie 


which gives the probability that the fixed quantity “, lies between the two random variables {2,—ko;/ VN and 
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A,+ko/VN . Hence (2.4.29) provides the probability that an interval with fixed length 2ko,/ VN and randomly 
centered at the estimated mean includes the true mean. If we choose k so that the probability defined by (2.4.29) is 
equal to 0.95, the interval is known as the 95 percent confidence interval. To understand the meaning of this 
reasoning, we stress that for each set of measurements we compute a confidence interval that either contains or does 
not contain the true mean. However, if we repeat this process for a large number of observation sets, about 95 percent 
of the obtained confidence intervals will include the true mean. We stress that by no means does this imply that a 
confidence interval includes the true mean with probability 0.95. 

If the variance g is unknown, then it has to be determined from the observations. This results in two 
modifications of (2.4.29). First, O, is replaced by 


1 N-I ks 
6 =—— J x(n- 2,7 (2.4.30) 
N-1 n=0 


which implies that the center and the length of the confidence interval are different for each set of observations. 
Second, the random variable (/7,—,)/ (6/VN ) is distributed according to Student’s t distribution with y= N-—-1 
degrees of freedom (Parzen 1960), which tends to a Gaussian for large values of N . In these cases, the factor k in 
(2.4.29) is replaced by the appropriate value ¢ of Student’s distribution, using MN -—1 degrees of freedom, for the 
desired level of confidence. 

If the observations are normal but not IID, then from (2.4.25), the mean estimator ĝÊ, is normal with mean Æ 
and variance (07 / N)[1+A,(p,)]. It is now easy to construct exact confidence intervals for Ê, if ,(1) is 
known, and approximate confidence intervals if 9,(/) is to be estimated from the observations. For large N , the 
variance var(ĝ,) can be approximated by 








2 2 N 
A 0 Oo av 
var =— [1+4 =—=| 1+2 1) |\=— 
(2,)=—l1+ Ay (P,)I | 2 Pal | 7 
Po (2.4.31) 
v=0, fis 2 po} 
1 
and hence an approximate 95 percent confidence interval for Ê, is given by 
a,-196 |>, ĝ,+1.96Vv/N (2.4.32) 








This means that, on average, the above interval will enclose the true value 4, on 95 percent of occasions. For many 
practical random processes (especially those modeled as ARMA processes), the result in (2.4.32) is a good 
approximation. 


EXAMPLE 2.4.1. Consider the AR(1) process 

x(n) = ax(n—1)+ @(n) -l<a<l 
where @(n) ~ WN(O, co) . We wish to compute the variance of the mean estimator ĝ, of the process x(n). Using 
straightforward calculations, we obtain 





4=0 o=—% and =p, (J) =a"" 








2a 1 a“ 2a 
A =—|1- + =— for N>1 
eer l N(1-a) Eea m 


When a1, that is, when the dependence between the signal samples increases, then the factor A,() takes large values and 
the quality of estimator decreases drastically. Similar conclusions can be drawn using the approximation (2.4.31) 


1425 a! 


= @ 
l-a’ (l-a) 
We will next verify these results using two Monte Carlo simulations: one for qg=0.9, which represents high correlations among 
samples, and the other for a=0.1. Using a Gaussian pseudorandom number generator with mean 0 and variance o? =1, we 
generated N=100 samples of the AR(1) process x(n). Using V in (2.4.31) and (2.4.32), we next computed the confidence 








2 2 

D, D, 
v= 2 
a 
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intervals. For a=0.9, we obtain 
v=100 and confidence interval : (#, —0.98, 7, + 0.98) 


and for a=0.1, we obtain 
v=1.2345 and confidence interval : (7, —0.2178, #,+0.2178) 


Clearly, when the dependence between signal samples increases, the quality of the estimator decreases drastically and hence the 
confidence interval is wider. To have the same confidence interval, we should increase the number of samples N . 

We next estimate the mean, using (2.4.20), and we repeat the experiment 10,000 times. Figure 2.8 shows histograms of the 
computed means for a=0.9 and a=0.1. The confidence intervals are also shown as dotted lines around the true mean. The 
histograms are approximately Gaussian in shape. The histogram for the high-correlation case is wider than that for the 
low-correlation case, which is to be expected. The 95 percent confidence intervals also indicate that very few estimates are outside 








the interval. 
High correlation: a = 0.9 
0.2 
: 95% Confidence 
£ -interval 
S 0.1 : 
2 
0 
-4 -3 —2 -1 0 1 2 3 4 
Estimated mean 4, 
Low correlation: a = 0.1 
; 95% Confidence 
: interval 
—4 -3 —2 -1 0 1 2 3 4 
Estimated mean 4, 
FIGURE 2.8 


Histograms of mean estimates in Example 2.4.1. 


2.4.3 Estimation of Variance 


The natural estimator of the variance O, of a stationary sequence x(n) from the observations {x(n)}) is the 
sample variance, given by 


ald is 
622—)) {x(n)- AF (2.4.33) 
N n=0 


By using the mean estimate 4, from (2.4.20), the mean of the variance estimator can be shown to equal (see 
Problem 2.31) 





A 1< l 
El} =0; —var(a,)=0; -— >> i i) Y.D) (2.4.34) 

N ZN N 

If the sequence x(n) is uncorrelated, then 
2 
0. N-1 

E{ 67} = 0. -— =| — |o? 2.4.35 
(ô) = 0; N ( a ) A ( ) 


From (2.4.34) or (2.4.35), it is obvious that the estimator in (2.4.33) is biased. If y,(1)—>0 as Į — œ> then 
var(ñ,)—>0 as “~—œ andhence ĝ? is an asymptotically unbiased estimator of o?. In practical applications, 
the variance estimate is nearly unbiased for large y . Note that if we use the actual mean 4y in (2.4.33), then the 
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resulting estimator is unbiased. 
The general expression for the variance of the variance estimator is fairly complicated and requires higher-order 
moments. It can be shown that for either estimators 
A 


N 
where y{® is the fourth central moment of x(n) (Brockwell and Davis 1991). Thus the estimator in (2.6.33) is 
also consistent. 





var(ĝ?) = for large N (2.4.36) 


Sampling distribution. In the case of the mean estimator, the sampling distribution involved the distribution of 
sums of random variables. The variance estimator involves the sum of the squares of random variables, for which the 
sampling distribution computation is complicated. For example, if there are N independent measurements from an 
N(0,1) distribution, then the sampling distribution of the random variable 


AER HG HHX (2.4.37) 


is given by the chi-squared distribution with N degrees of freedom. The general form of yẹ with v degrees of 
freedom is 


1 jit x 
fe (x) = 2 Tvd) g exp (->) O0<x<æ (2.4.38) 


where T(v/2)= [ e~t"? dt is the gamma function with argument v/2. 


For the variance estimator in (2.4.33), it can be shown (Parzen 1960) that NG? is distributed as chi squared 
with v=N-—1 degrees of freedom. This means that, for any set of N observations, there will only be N-1 
independent deviations {x(n)—/7,}, since their sum is zero from the definition of the mean. Assuming that the 
observations are ~ (4, o°), the random variables x(n)/o willbe ~w (w/o, 1) and hence the random variable 

N SC 1 N-I Pe 
Sa [x(n)— d 2.4.39 
-r 2 a (2.4.39) 
will be chi squared distributed with v = N -—1. Therefore, using values of the chi-squared distribution, confidence 
intervals for the variance estimator can be computed. In particular, since NG@?/o* is distributed as y;, the 95 


percent limits of the form 
Pr A (2) < NSI’ < X, [ -2%)] =0.95 (2.4.40) 


can be obtained from chi-squared tables (Fisher and Yates 1938). By rearranging (2.4.40), the random variable 


o°lĝ? satisfies 

N o N 

Pr ee r TOTA =0.95 (2.4.41) 
%, (0.975) ô: %, (0.025) 


Using h = N/%, (0.975) and h = N/%, (0.025) , we see that (2.4.41) implies that 
Pr{l,g220° and 1,62<0°}=0.95 (2.4.42) 


Thus the 95 percent confidence interval based on the estimate G2 is (4.G2,l6;) . Note that this interval is sensitive 
to the validity of the normal assumption of random variables leading to (2.4.39). This is not the case for the 
confidence intervals for the mean estimates because, thanks to the central limit theorem, the computation of the 
interval can be based on the normal assumption. 


EXAMPLE 2.4.2 Consider again the AR(1) process given in Example 2.4.1: 
x(n) = ax(n—-1)+ @(n) -l<a<l a(n) ~ WN(0,1) 
with 2 (2.4.43) 
H,=0 o7=—22, and p,()=a" 


* l- 
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We wish to compute the mean of the variance estimator g? ofthe process x(n) . From (2.4.34), we obtain 


E=- (1-!)} a" 
j N N 


l=-N 


(2.4.44) 





When a — 1, that is, when the dependence between the signal samples increases, the mean of the estimate deviates significantly 

from the true value g2 and the quality of the estimator decreases drastically. For small dependence, the mean is very close to o?. 

These conclusions can be verified using two Monte Carlo simulations as before: one for a=0.9, which represents high 

correlations among samples, and the other for a=0.1. Using a Gaussian pseudorandom number generator with mean 0 and unit 

variance, we generated N=100 samples of the AR(1) process x(n). The computed parameters according to (2.4.43) and (2.4.44) 
are 


a=0.9: o? =5,2632 E{ 62} = 4.3579 
a=0.1: o- =1.0101 E{ 8%} =0.9978 
We next estimate the variance by using (2.4.33) and repeat the experiment 10,000 times. Figure 2.9 shows histograms of computed 
variances for a=0.9 and for a=0.1. The computed means of the variance estimates are also shown as dotted lines. Clearly, the 
histogram is much wider for the high-correlation case and much narrower (almost symmetric and Gaussian) for the low-correlation 
case. 
High correlation: a = 0.9 


0.03 
‘Mean of variance 
£ 0.02 i 95% Confidence interval 
© 
8 
Nn 
= 0.01 





0 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Estimated var (2) 
Low correlation: a = 0.1 
0.3 
€ 0.2 
S 
D 
g 
N 
T 01 
0 





Estimated var 02) 


FIGURE 2.9 

Histograms of variance estimates in Example 3.4.2. 

The 95 percent confidence intervals are given by (G7, 1.62), where l = N/% (0.975) and lz = N/y,(0.025). The values 
of J, and l, are obtained from the chi-squared distribution curves (Jenkins and Watts 1968). For N=100, /,=0.77 and 
L, =1.35 ; hence the 95 percent confidence intervals for O7 are 


(0.7762, 1.3562) 
also shown as dashed lines around the mean value E{ G2}. The confidence interval for the high-correlation case, a =0.9, does not 


appear to be a good interval, which implies that the approximation leading to (2.4.42) is not a good one for this case. Such is not the 
case for a=0.1. 


2.5 Summary 


In this chapter we provided an overview of the basic theory of discrete-time stochastic processes. We began with the 
descriptron of stochastic processes. To describe stochastic processes, we proceeded to define mean and 
autocorrelation sequences. In many applications, the concept of stationary of random processes is a useful one that 
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reduces the computational complexity. Assuming time invariance on the first two moments, we defined a wide-sense 
stationary (WSS) process in which the mean is a constant and correlation between random variables at two distinct 
times is a function of time difference or lag. The rest of the chapter was devoted to the analysis of WSS processes. 

A stochastic process is generally observed in practice as a single sample function (a speech signal or a radar 
signal) from which it is necessary to estimate the first- and the second-order moments. This requires the notion of 
ergodicity, which provides a framework for the computation of statistical averages using time averages over a single 
realization. Although this framework requires theoretical results using mean square convergence, we provided a 
simple approach of using appropriate time averages. An important random signal characteristic called variability was 
introduced. The WSS processes were then described in the frequency domain using the power spectral density 
function, which is a physical quantity that can be measured in practice. Some random processes exhibiting flat 
spectral envelopes were analyzed including one of white noise. Since random processes are generally processed using 
linear systems, we described linear system operations with random inputs in both the time and frequency domains. 

The properties of correlation matrices and sequences play an important role in filtering and estimation theory 
and were discussed in detail, including eigenanalysis. Another important random signal characteristic called memory 
was also introduced. Stationary random signals were modeled using autocorrelation matrices, and the relationship 
between spectral flatness and eigenvalue spread was explored. These properties were used in an alternate 
representation of random vectors as well as processes using uncorrelated components which were based on 
diagonalization and triangularization of correlation matrices. 

Finally, we concluded this chapter with the introduction of elementary estimation theory. After discussion of 
properties of estimators, two important estimators of mean and variance were treated in detail along with their 
sampling distributions. These topics will be useful in many subsequent chapters. 


Problems 


2.1 The exponential density function is given by 
l 
f(x) =— eulx) (P.1) 
a 


where A isa parameter and u(x) isa unit step function. 
(a) Plot the density function for a =1. 
(b) Determine the mean, variance, skewness, and kurtosis of the Rayleigh random variable with a=1. Comment on the 
significance of these moments in terms of the shape of the density function. 
(c) Determine the characteristic function of the exponential pdf. 
2.2 The Rayleigh density function is given by 


f.(0)= see u(x) P2) 


where øg isa parameter and u(x) isa unit step function. Repeat Problem 2.1 for @=1. 
2.3 Using the binomial expansion of {x(¢)— 4, }” , show that the 7M th central moment is given by 
m m i 
uS =2, | i Jen MiGs 


k=0 


mim 
Similarly, show that en = $| k ) uM. 
k=0 


2.4 Consider a zero-mean random variable x(¢) . Let us discuss cumulant generating function and is given by 
# (5) Ê In p(s) =In E{e™®} i 


When S is replaced by j in (P.2.1), the resulting function is known as the second characteristic function and is denoted by 
(6). 


The cumulants mm of a random variable x(¢) are defined as the derivatives of the cumulant generating function, that is, 


Rms d” [P (s)] 


r am TELO 
as" =(-j)"—_>- 


m=1,2,--- 
s=0 dg 








é=0 


Clearly, x4) =(). It can be shown that for a zero-mean random variable, the first five cumulants as functions of the central 


2.5 


2.6 
2.7 


2.8 


2.9 


2.10 


2.11 


2.12 
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moments are given by 


=n =y, =0 
K? = y? =0? 
K? = a 

K? SY, 30, 


Using (P.2.1), show that the first four cumulants of x(¢) are given by (P.2.3) through (P.2.6). 
A random vector x()=[xi(¢) x2(¢)]’ has mean vector 44 =[1 2]' and covariance matrix 


4 0.8 
r, = 
oe 1 


This vector is transformed to another random vector y(¢ ) by the following linear transformation: 


y(S)| [1 3 
y($)|=|-1 2 ee 
y3(9) : or 


Random Sequences 


Determine (a) the mean vector Ly» (b) the autocovariance matrix T`, , and (c) the cross-correlation matrix R,- 
Using the moment generating function, show that the linear transformation of a Gaussian random vector is also Gaussian. 


Let {x,(¢)}{_, be four IID random variables with exponential distribution (P.1) with a=1. Let 
k 
WOZ xa)  1sks4 
[= 


(a) Determine and plot the pdf of Y, (¢). 
(b) Determine and plot the pdf of y;(¢). 
(c) Determine and plot the pdf of y,(¢). 
(d) Compare the pdf of y,(¢) with that of the Gaussian density. 


For each of the following, determine whether the random process is (1) WSS or (2) m.s. ergodic in the mean. 


(a) X(t)= A, where A is a random variable uniformly distributed between 0 and 1. 

(b) X, = Acos@n, where A is a Gaussian random variable with mean 0 and variance 1. 
(c) A Bernoulli process with Pr[X, =1]= p and Pr[X, =-1]=1- p. 

Consider the harmonic process x(n) defined in (2.1.50). 

(a) Determine the mean of x(n). 

(b) Show that the autocorrelation sequence is given by 


N 
r= Diba, P cosol —0o< [<0 
k=l 
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Suppose that the random variables ø, in the real-valued harmonic process model are distributed with a pdf 


fa (Q) =(1+cos@,)/(27), -m <Q, <m .Is the resulting stochastic process stationary? 
A stationary random sequence x(n) with mean sz, =4 and autocovariance 


(n) 4—|n| |n|S<3 
n)= 
ts 0 otherwise 


is applied as an input to a linear shift-invariant (LSI) system whose impulse response h(n) is 
h(n) = u(n) —u(n—4) 


where u(n) is a unit step sequence. The output of this system is another random sequence y(n). Determine (a) the mean 
sequence 44,(n), (b) the cross-covariance y,,(m,n2) between x(n,) and y(n), and (c) the autocovariance y,(m,n2) of 


the output process y(n). 
A causal LTI system, which is described by the difference equation 


yin) =5 y(n) + x(n) += x(n- 


is driven by a zero-mean WSS process with autocorrelation r,(/) = 0.5". 
(a) Determine the PSD and the autocorrelation of the output sequence y(n). 


(b) Determine the cross-correlation r, (l) andcross-PSD R,, (e/”) between the input and output signals. 
2.13 A WSS process with PSD R,(e”) =1/(1.64+1.6cos @) is applied to a causal system described by the following difference 
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2.14 


2.15 


2.16 


2.17 


2.18 


2.19 


2.20 


2.21 
2.22 


2.23 


2.24 


2.25 


equation 

y(n) = 0.6 y(n—1) + x(n) +1.25x(n—-1) 
Compute (a) the PSD of the output and (b) the cross-PSD Ry (e?) between input and output. 
Determine whether the following matrices are valid correlation matrices: 


, if 
2 4 
(a)R I i (b)R 1,1! 
a = =|— — 
= ai = 2 2 
ii, 

4 2 
I&i 

2 
j 1 1 
R, = d)R,=|— 2 — 
OR=|,; v (OR, 2 2 
t £i 


Consider a normal random vector x(¢ ) with components that are mutually uncorrelated, that is, pi =9- Show that (a) the 
covariance matrix I’, is diagonal and (b) the components of x(¢) are mutually independent. 

Show that if a real, symmetric, and nonnegative definite matrix R has eigenvalues /,,2),...,2y, then the matrix R* has 
eigenvalues A‘, At,..., AK. 

Prove that the trace of R is given by 


trR = >A 
Prove that the determinant of R is given by 
det R -| R|- [444] 


Show that the determinants of R and T are related by 
det R = det (1+ y” TH) 

Let R, be the correlation matrix of the vector x =[x(0) x(2) x(3)]’, where x(n) isa zero-mean WSS process. 
(a) Check whether the matrix R, is Hermitian, Toeplitz, and nonnegative definite. 
(b) If we know the matrix R, , can we determine the correlation matrix of the vector ¥ =[x(0) x(1) x(2) x(3)]' ? 
Using the nonnegativeness of E{[x(n+1)+x(n)]’}, show that r,(0) >| 7,(/)| for all /. 
Show that ;,(/) is nonnegative definite, that is, 

M M 


ar (l-k); 20 YM,Ya, ay 


l= k= 


Let x(n) be a random process generated by the AP(1) system 
x(n) = ax(n—-1)+ @(n) n20 x(-1)=0 
where @(n) isan J[D(0,02) process. 
(a) Determine the autocorrelation r,(7,,n2) function. 
(b) Show that r,(m,m) asymptotically approaches r,(n, — m), that is, it becomes shift-invariant. 
Let x bearandom vector with mean //, and autocorrelation R,. 
(a) Show that y = Q” x transforms X toan uncorrelated component vector Y if Q isthe eigenmatrix of R,. 
(b) Comment on the geometric interpretation of this transformation. 
The mean and the covariance of a Gaussian random vector x are given by, respectively, 


1 

1 Ls 
n=l; and JY= 

1 


2.26 


2.27 


2.28 


2.29 


2.30 
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Plot the 10,20, and 3g concentration ellipses representing the contours of the density function in the (x,,x,) plane. Hints: 
The radius of an ellipse with major axis a (along xı) and minor axis b<a (along x,) is given by 
2 ab 
a’ sin? +b’ cos’ 0 
where 0<@<27. Compute the lo ellipse specified by a= JA and b=./A, and then rotate and translate each point 
x =[x(? xf] using the transformation w =Q,x + 4- 
Consider the process x(44) = ax( 4—1) + @( 4), where ap) ~ WN(0, 02) - 
(a) Show that the MXM correlation matrix of the process is symmetric Toeplitz and is given by 


r 





1 a siw gh 
R= o} a 1 mre 
* l-a 
a" qm 1 
(b) Verify that 
1 —a 0 0 
-a |+a -a 0 
R' =e 0 -a i : 
Po) : l+a -a 
0 0 —a 1 
(c) Show that if 
1 0 0 
|a i 0 
fe ee Hg 
0 0 -a 1 


then ITR.L, =(1-a°)I. 

(d) For o} =1,a=0.95,and M =8 compute the DKLT and the DFT. 

(e) Plot the eigenvalues of each transform in the same graph of the PSD of the process. Explain your findings. 
(f) Plot the eigenvectors of each transform and compare the results. 

(g) Repeat parts (e) and (f) for M =16 and M =32. Explain the obtained results. 

(h) Repeat parts (e) to (g) for a=0.5 and compare with the results obtained for a=0.95. 

Determine three different innovations representations of a zero-mean random vector x with correlation matrix 


el 


Verify that the eigenvalues and eigenvectors of the M XM correlation matrix of the process x(4) = @(4)+bæ(n-—1), where 
an) ~ WN(0,02) are given by A =R,(e), gs? =sin an, @ =7k/(M +1), where k =1,2,---,M , (a) analytically 
and (b) numerically for o, =] and M =8. Hint: Plot the eigenvalues on the same graph with the PSD. 
Consider the process x(n) = @(n)+bæ(n-—1)- 
(a) Compute the DKLT for M =3. 
(b) Show that the variances of the DKLT coefficients are o?(1+ V2b), o2,and o?(1- 2b) ; 

N-I 


Let x(n) be a stationary random process with mean y, and covariance y,(/). Let fA, =V/N > x(n) be the sample mean 


n=0 
from the observations {x(n)}") . 
(a) Show that the variance of A, is given by 


N N 
var(û)= N'Y (1-4) DSN’ Ť 17.0] 
IEN N i 


=-N 


(P.4) 
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(b) Show that the above result (P.4) can be expressed as 


co 








var(A,)=—"[1+ An (P) (P.5) 
Č l (1) 
A =2 1-— l D) == 
where wv (Px) | TJO p, (1) r 


(c) Show that (P.4) reduces to var(ĝ,) = o IN fora WN(4z,, o ) process. 
2.31 Let x(n) bea stationary random process with mean j,, variance o?,and covariance y,(J) . Let 


1 N-1 3 
&t— Y xn- âF 
N n=0 


be the sample variance from the observations {x(n)}7}. 
(a) Show that the mean of G? is given by 


. 1 w l 
El} =o} -var(fi,) =o -1$ -L yD 


(b) Show that the above result reduces to var(ĝ,)=(N — lo2/N fora WN(,,02) process. 
2.32 The Cauchy distribution with mean yy is given by 


1 1 
tT 


Let {x.(¢)}%, be N IID random variables with the above distribution. Consider the mean estimator based on {xy (f)}%, 


N 
a= x0 
No 


—oo < x< 


Determine whether A(¢) isa consistent estimator of 4. 
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CHAPTER 3 


Linear Signal Models 


In this chapter we introduce and analyze the properties of a special class of stationary random sequences that are 
obtained by driving a linear, time-invariant system with white noise. We focus on filters having a system function that 
is rational, that is, the ratio of two polynomials. The power spectral density of the resulting process is also rational, 
and its shape is completely determined by the filter coefficients. We will use the term pole-zero models when we 
want to emphasize the system viewpoint and the term autoregressive moving-average models to refer to the resulting 
random sequences. The latter term is not appropriate when the input is a harmonic process or a deterministic signal 
with a flat spectral envelope. We discuss the impulse response, autocorrelation, power spectrum, partial 
autocorrelation, and cepstrum of all-pole, all-zero, and pole-zero models. We express all these quantities in terms of 
the model coefficients and develop procedures to convert from one parameter set to another. Low-order models are 
studied in detail, because they are easy to analyze analytically and provide insight into the behavior and properties of 
higher-order models. An understanding of the correlation and spectral properties of a signal model is very important 
for the selection of the appropriate model in practical applications. 


3.1 Introduction 


In Chapter 2 we defined and studied random processes as a mathematical tool to analyze random signals. In practice, 
we also need to generate random signals that possess certain known, second-order characteristics, or we need to 
describe observed signals in terms of the parameters of known random processes. 

The simplest random signal model is the wide sense stationary white noise sequence @(n)~ WN(0,o3) that 
has uncorrelated samples and a flat PSD. It is also easy to generate in practice by using simple algorithms. If we filter 
white noise with a stable LTI filter, we can obtain random signals with almost any arbitrary aperiodic correlation 
structure or continuous PSD. If we wish to generate a random signal with a line PSD using the previous approach, we 
need an LTI filter with “line” frequency response; that is, we need an oscillator. Unfortunately, such a system is not 
stable, and its output cannot be stationary. Fortunately, random signals with line PSDs can be easily generated by 
using the harmonic process model (linear combination of sinusoidal sequences with statistically independent random 
phases) discussed in Section 2.1.6 Figure 3.1 illustrates the filtering of white noise and “white ” (flat spectral 
envelope) harmonic process by an LTI filter. Signal models with mixed PSDs can be obtained by combining the 
above two models, a process justified by a powerful result known as the Wold decomposition. 
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FIGURE 3.1 
Signal models with continuous and discrete(line)power spectrum densities. 


When the LTI filter is specified by its impulse response, we have a nonparametric signal model because there is 
no restriction regarding the form of the model and the number of parameters is infinite. However, if we specify the 
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filter by a finite-order rational system function, we have a parametric signal model described by a finite number of 
parameters. We focus on parametric models because they are simpler to deal with in practical applications. The two 
major topics we address in this chapter are (1) the derivation of the second-order moments of AP, AZ, and PZ models, 
given the coefficients of their system function, and (2) the design of an AP, AZ, or PZ system that produces a random 
signal with a given autocorrelation sequence or PSD function. The second problem is known as signal modeling and 
theoretically is equivalent to the spectral factorization procedure developed. The modeling of harmonic processes is 
theoretically straightforward and does not require the use of a linear filter to change the amplitude of the spectral lines. 
The challenging problem in this case is the identification of the filter by observing its response to a harmonic process 
with a flat PSD. The modeling problem for continuous PSDs has a solution, at least in principle, for every regular 
random sequence. 

In practical applications, the second-order moments of the signal to be modeled are not known a priori and have 
to be estimated from a set of signal observations. This element introduces a new dimension and additional 
complications to the signal modeling problem, which are discussed in Chapter 8. In this chapter we primarily focus 
on parametric models that replicate the second-order properties (autocorrelation or PSD) of stationary random 
sequences. If the sequence is Gaussian, the model provides a complete statistical characterization. 


3.1.1 Linear Nonparametric Signal Models 


Consider a stable LTI system with impulse response h(n) and input @(n).The output x(n) is given by the 
convolution summation 


x(n) = > h(k)@(n—-k) (3.1.1) 


k=-00 
which is known as a nonrecursive system representation because the output is computed by linearly weighting 
samples of the input signal. 
Linear random signal model. If the input @(n) is a zero-mean white noise process with variance oz, 
autocorrelation 7,(/)=0206(1), and PSD R,(e!”)=0%3, -m <@<7 , then from Table 2.2 the autocorrelation, 
complex PSD, and PSD of the output x(n) are given by, respectively, 





rag, > h(k)h (k -l) = 027, D (3.1.2) 
k=-00 
R (z) = o} ACH =| (3.1.3) 
Z 
R (e°) = o} | H(e”) P= o}R, (e1?) (3.1.4) 


We notice that when the input is a white noise process, the shape of the autocorrelation and the power spectrum 
(second-order moments) of the output signal are completely characterized by the system. We use the term 
system-based signal model to refer to the signal generated by a system with a white noise input. If the system is linear, 
we use the term linear random signal model. In the statistical literature, the resulting model is known as the general 
linear process model. However, we should mention that in some applications it is more appropriate to use a 
deterministic input with flat spectral envelope or a “white” harmonic process input. 

Recursive representation. Suppose now that the inverse system H,(z)=1/H(z) is causal and stable. If we 
assume, without any loss of generality, that A(0)=1, then h,(n)=Z~'{H;(z)} has h,(0)=1. Therefore the 
input @n) can be obtained by 


a(n) = x(n) + >" h, (k)x(n—k) (3.1.5) 
k=l 
Solving for x(n), we obtain the following recursive representation for the output signal 
x(n) =->h, (k)x(n-k)+ a(n) (3.1.6) 
k=l 


We use the term recursive representation to emphasize that the present value of the output is obtained by a linear 
combination of all past output values, plus the present value of the input. By construction the nonrecursive and 
recursive representations of system h(n) are equivalent; that is, they produce the same output when they are excited 
by the same input signal. 
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Innovations representation. If the system H(z) is minimum-phase, then both h(n) and h,(n) are causal 
and stable. Hence, the output signal can be expressed nonrecursively by 


x(n)= hank = Ý hn hark) (3.1.7) 
k=0 


k=—% 
or recursively by (3.1.6). 
From (3.1.7) we obtain 


x(n+1)= > h(n+1—k)a(k)+a(n+)) 


k=-00 


or by using (3.1.5) 


x(n+1)= > hinti-b Y h,(k-j)x(j)+ ont) (3.1.8) 
u_-_’ 


k=-20 j=—> 


new information 

past information: linear combination of x(n),x(n-1), 
Careful inspection of (3.1.8) indicates that if the system generating x(n) is minimum-phase, the sample @(n+1) 
brings all the new information (innovation) to be carried by the sample x(m+1). All other information can be 
predicted from the past samples x(n), x(n—1),--- of the signal (see Section 5.6). We stress that this interpretation 
holds only if H(z) is minimum-phase. 


_ The system H(z) generates the signal x(n) (x) ~ DOOD x(n) Synthesis or 
by introducing dependence in the white noise input ef oo Coloring filter 
a(n) and is known as the synthesis or coloring filter. 

In contrast, the inverse system H,(z) can be used to 
recover the input @(n) and is known as the analysis x(n) 
or whitening filter. In this sense the innovations 
sequence and the output process are completely 


equivalent. The synthesis and analysis filters are FIGURE 3.2 
shown in Figure 3.2. Synthesis and analysis filters used in innovations representation. 


H= h a(n) Analysis or 
whitening filter 


Spectral factorization 


Most random processes with a continuous PSD R,(e’”) can be generated by exciting a minimum-phase 
system Hyin(z) with white noise. The PSD of the resulting process is given by 


R,(e!”) =o | H na (e'”) (3.1.9) 


The process of obtaining Hmin(z) from R, (e?) or r,(l) is known as spectral factorization. 
If the PSD R,(e’”) satisfies the Paley-Wiener condition 


[lin R,(€3) |da@<ee (3.1.10) 
then the process x(n) is called regular and its complex PSD can be factored as follows 
a (1 
R (2) = 02H (ze... (=) (3.1.11) 
Z 
2 1 jo 
where o2 =exp,— n In[R, (e!”)]d@ (3.1.12) 
2 


is the variance of the white noise input and can be interpreted as the geometric mean of R,(e!”). Consider the 
inverse Fourier transform of In R,(e’”): 


A l jo jkw 
ck) ê— [f miR,(*)]e** do (3.1.13) 


which is a sequence known as the cepstrum of r,(1). Note that c(0)= ow. Thus in the cepstral domain, the 
multiplicative factors Hmin(z) and Hiin(1/z*) are now additively separable due to the natural logarithm of 
R, (e°) . Define 
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c(0) 


c, (k) = -3 telk ulk -1) (3.1.14) 
and c(k)F O eku =if) (3.1.15) 


as the positive- and negative-axis projections of c(k), respectively, with c(0) distributed equally between them. Then 
we obtain 


hain (0) =. 7 exp. [c, (k)]} (3.1.16) 
as the impulse response of the minimum-phase system H pin (z). Similarly, 
hoa, (n) =. 7 '{exp.F [c_(k)]} (3.1.17) 


is the corresponding maximum-phase system. This completes the spectral factorization procedure for an arbitrary 
PSD R,(e’”), which, in general, is a complicated task. However, it is straightforward if R,(z) is a rational 
function. 


Spectral flatness measure 


The spectral flatness measure (SFM) of a zero-mean process with PSD R,(e!”) is defined by (Makhoul 1975) 


exp [. infR,(e*y1a0} 


(3.1.18) 
eS | Redo 
2m +r * 


SFM, = 


where the second equality follows from (3.1.12). It describes the shape (or more appropriately, flatness) of the PSD 
by a single number. If x(n) is a white noise process, then R,(e”)=o7 and SFM, =1. More specifically, we 
can show that 


0<SFM, <1 (3.1.19) 


Observe that the numerator of (3.1.18) is the geometric mean while the denominator is the arithmetic mean of a 
real-valued, nonnegative continuous waveform R,(e!”). Since x(n) is a regular process satisfying (3.1.10), these 
means are always positive. Furthermore, their ratio, by definition, is never greater than unity and is equal to unity if 
the waveform is constant. This, then, proves (3.1.19). A detailed proof is given in Jayant and Noll (1984). 
When x(n) is obtained by filtering the zero-mean white noise process @(n) through the filter H(z), then 
the coloring of R,(e!”) isdueto H(z). In this case, R,(e’”)=o02|H(e'”)| from (3.1.9), and we obtain 
2 2 
CT a ee. See (3.1.20) 


? = 
oy 1 2 j@ |2 1 j@ |2 
x —| o| He) dø — H (e) | dæ 
z LIRE do =f Ee] 
Thus SFM , is the inverse of the filter power (or power transfer factor) if h(0) is normalized to unity. 


3.1.2 Parametric Pole-Zero Signal Models 


Parametric models describe a system with a finite number of parameters. The major subject of this chapter is the 
treatment of parametric models that have rational system functions. To this end, consider a system described by the 
following linear constant-coefficient difference equation 


P Q 
x(n)+ >" a, x(n—k) => d, an—k) (3.1.21) 
k=l k=0 


where @(n) and x(n) are the input and output signals, respectively. Taking the z -transform of both sides, we 
find that the system function is 
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Q 


a 
j- H a (3.1.22) 


W(z) 1+5 az“ A(z) 


k=l 





We can express H(z) in terms of the poles and zeros of the system as follows: 


Q 
[[q-2z 
H(z)=d,-=——_—_—_ (3.1.23) 
[a-p z> 
k=l 
The system has Q zeros {zą} and P poles {p,} (zeros and poles at z=0 are not considered here). The term do 


is the system gain. For the rest of the book, we assume that the polynomials D(z) and A(z) do not have any 
common roots, that is, common poles and zeros have already been canceled. 


Types of pole-zero models 


There are three cases of interest: 
e For P>0 and Q>0, we have a pole-zero model, denoted by PZ( P,Q ). If the model is assumed to be causal, 
its output is given by 


x= Fan 4 aD (3.1.24) 

e For P=0, we have an all-zero model, eres by AZ(Q). The input-output difference equation is 
x(n) -F'eni (3.1.25) 

e For Q=0, we have an all-pole model, denoted by APC P ). The input-output difference equation is 
x(n) wy a,x(n—k)+d,w(n) (3.1.26) 


k=l 
If we excite a parametric model with white noise, we obtain a signal whose second-order moments are 
determined by the parameters of the model. Indeed, from Section 2.2.2, we recall that if w(n) ~ ID{0,0}} with 
finite variance, then 


r (D) = 027, (D = 02 h(1) * h* (-l) (3.1.27) 

R,(z)=O2R,(z)=02H(z)H" (=) (3.1.28) 
“A 

R (e?) = 02R,(e!”) = 02 | H (e?) P (3.1.29) 


Such signal models are of great practical interest and have special names in the statistical literature: 
° The AZ(Q ) is known as the moving-average model, denoted by MA(Q). 
* The AP(P ) is known as the autoregressive model, denoted by AR( P ). 
* The PZ( P ,Q) is known as the autoregressive moving-average model, denoted by ARMA (P,Q). 
We specify a parametric signal model by 


normalizing dọ =1 and setting the variance of the Input D2) Output 
. 2 . H(z) = AQ) 
input to o-,. The defining set of model parameters w(n) 7 x(n) 
is given by {a,@,-:-,ap,di,---,dg,0.} (see uda nian 
Figure 3.3). An alternative is to set g =1 and {0,2 dyn- do» ays- ap} 

w 


leave dy arbitrary. We stress that these models 
assume the resulting processes are stationary, FIGURE 3.3 te l 
which is ensured if the corresponding systems are Block diagram representation of a parametric, rational signal model. 


BIBO stable. 
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Short-memory behavior 


To find the memory behavior of pole-zero models, we investigate the nature of their impulse response. To this 
end, we recall that for Q > P, (3.1.23) can be expanded as 


Q-P f P A 
H(z)= >) B, 4 Pree (3.1.30) 
j= k= 1— Py Z 
where for simplicity we assume that the model has P distinct poles. The first term in (3.1.30) disappears if P >Q. 
The coefficients B, can be obtained by long division: 


A, =(1- p, Z™)H (2) |,<p, (3.1.31) 


If the model is causal, taking the inverse Z -transform results in an impulse response that is a linear combination of 
impulses, real exponentials, and damped sinusoids (produced by the combination of complex exponentials) 


Q-P R P, 
h(n) = >> B,O(n- j)+ >. A,(p,)"un) + >) Cr” cos(@n+ g,)uln) (3.1.32) 
j=0 k=l i=l 


where p;=nņe*}® and P= RP +2P,. Recall that u(n) and 6(n) are the unit step and unit impulse functions, 

respectively. We note that the memory of any all-pole model decays exponentially with time and that the rate of 

decay is controlled by the pole closest to the unit circle. The contribution of multiple poles at the same location is 

treated in Problem 3.1. 

Careful inspection of (3.1.32) leads to the following conclusions: 

1. For AZ( Q ) models, the impulse response has finite duration and, therefore, can have any shape. 

2. The impulse response of causal AP( P ) and PZ( P , Q) models with single poles consists of a linear combination 
of damped real exponentials (produced by the real poles) and exponentially damped sinusoids (produced by 
complex conjugate poles). The rate of decay decreases as the poles move closer to the unit circle and is determined 
by the pole closest to the unit circle. 

3. The model is stable if and only if A(n) is absolutely summable, which, due to (3.1.32), is equivalent to | p; |<1 
for all k . In other words, a causal pole-zero model is BIBO stable if and only if all the poles are inside the unit 
circle. 

We conclude that causal, stable PZ( P , Q) models with P >0 have an exponentially fading memory because 
their impulse response decays exponentially with time. Therefore, the autocorrelation 7,(/) = h(I)*h*(—1) also 
decays exponentially (see Example 3.2.2), and pole-zero models have short memory according to the definition given 
in Section 2.2.3. 


Generation of random signals with rational power spectra 


Sample realizations of random sequences with rational power spectra can be easily generated by using the 
difference equation (3.1.24) and a random number generator. In most applications, we use a Gaussian excitation 
because the generated sequence will also be Gaussian. For non-Gaussian inputs, it is difficult to predict the type of 
distribution of the output signal. If, on one hand, we specify the frequency response of the model, the coefficients of 
the difference equation can be obtained by using a digital filter design package. If, on the other hand, the power 
spectrum or the autocorrelation is given, the coefficients of the model are determined via spectral factorization. If we 
wish to avoid the transient effects that make some of the initial output samples nonstationary, we should consider the 
response of the model only after the initial transients have died out. 


3.1.3 Mixed Processes and Wold Decomposition 


An arbitrary stationary random process can be constructed to possess a continuous PSD R,(e!”) and a discrete 
power spectrum R,(k). Such processes are called mixed processes because the continuous PSD is due to regular 
processes while the discrete spectrum is due to harmonic (or almost periodic) processes. A further interpretation of 
mixed processes is that the first part is an unpredictable process while the second part is a predictable process (in the 


1 se Li n 1 ; 
Poles on the unit circle are discussed in Section 3.5. 
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sense that past samples can be used to exactly determine future samples). This interpretation is due to the Wold 
decomposition theorem. 


THEOREM 3.1 (WOLD DECOMPOSITION). A general stationary random process can be written as a sum 
x(n) = x,(n) +x, (n) (3.1.33) 


where x,(n) is a regular process possessing a continuous spectrum and x,(n) is a predictable process possessing a discrete 
spectrum. Furthermore, x,(m) is orthogonal to x, (m) ; that is, 


E{x,(n)x,(n,)} =0 for all n,n, (3.1.34) 


The proof of this theorem is very involved, but a good approach to it is given in Therrien (1992). Using (3.1.34), 
the correlation sequence of x(n) in (3.1.33) is given by 


r(l)= A O+r, (D 


from which we obtain the continuous and discrete spectra. As discussed above, the regular process has an innovations 
representation q@(n) that is uncorrelated but not independent. For example, @(n) can be the output of an all-pass 
filter driven by an IID sequence. 


3.2 All-Pole Models 


We start our discussion of linear signal models with all-pole models because they are the easiest to analyze and the 
most often used in practical applications. We assume an all-pole model of the form 


dy z do E dy 
P P 

A(z) 1+% az“ [[¢- 7.2") 
k=l k=l 


where d, is the system gain and P is the order of the model. The all-pole model can be implemented using either 
a direct or a lattice structure. The conversion between the two sets of parameters can be done by using the step-up and 
step-down recursions. 


3.2.1 Model Properties 


In this section, we derive analytic expressions for various properties of the all-pole model, namely, the impulse 
response, the autocorrelation, and the spectrum. We determine the system-related properties 7,(/) and R,(e!”) 
because the results can be readily applied to obtain the signal model properties for inputs with both continuous and 
discrete spectra. 

Impulse response. The impulse response h(n) can be specified by first rewriting (3.2.1) as 


k 
H(z)+ a, H(2)7* =d, 


k=l 


H(z)= (3.2.1) 


and then taking the inverse z -transform to obtain 


P 
h(n) + >" a,h(n—k) = d,6(n) (3.2.2) 
k=l 
If the system is causal, then 


h(n) ae a,h(n—k)+d,6(n) (3.2.3) 


k=l 
If H(z) has all its poles inside the unit circle, then h(n) is a causal, stable sequence and the system is 
minimum-phase. From (3.2.3) we have 


h(0) =, (3.2.4) 


P 
h(n)=-)\a,h(n-k) n>0 (3.2.5) 
k=l 
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and owing to causality we have 
h(n) =0 n<0 (3.2.6) 
Thus, except for the value at n=0, h(n) can be obtained recursively as a linearly weighted summation of its 
previous values h(n—1),---,h(n—P). One can say that h(n) can be predicted (with zero error for n #0) from 
the past P values. Thus, the coefficients {a,} are often referred to as predictor coefficients. Note that there is a 
close relationship between all-pole models and linear prediction that will be discussed in Section 3.2.2. 
From (3.2.4) and (3.2.5), we can also write the inverse relation 


n-l = 

a, aine Ta n>0 (3.2.7) 
h(0) ‘a h(0) 

with ao =1. From (3.2.7) and (3.2.4), we conclude that if we are given the first P+1 values of the impulse 


response h(n), O <n < P, then the parameters of the all-pole filter are completely specified. 
Finally, we note that a causal H(z) can be written as a one-sided, infinite polynomial 


H(z)= > h(n)z" . This representation of H(z) implies that any finite-order, all-pole model 
z n=0 


can be represented equivalently by an infinite number of zeros. In general, a single pole can be represented by an 
infinite number of zeros, and conversely a single zero can be represented by an infinite number of poles. If the poles 
are inside the unit circle, so are the corresponding zeros, and vice versa. 


EXAMPLE 3.2.1. A single pole at z =a can be represented by 
1 
H(z)= =} a"z"  Jalxl (3.2.8) 


l-az' 4 





The question is, where are the infinite number of zeros located? To find the answer, let us consider the finite polynomial 


N 
H,()=)a°2" (3.2.9) 
n=0 
where we have truncated H(z) at n=N.Thus Hy(z) isa geometric series that can be written in closed form as 
: L „NH -(N+1) 
Kw — (3.2.10) 
1—az 
And Hy(z) hasasingle pole at z=a and N+] zeros at 
z = qaet?" (N+ i=0, L = N (3.2.11) 


The N +1 zeros are equally distributed on the circle | z|=a_ with one of the zeros (for i =(Q) located at z =a. But the zero at 
z=a cancels the pole at the same location. Therefore, H(z) has the remaining N zeros: 


goo en i=l, 2, =, N (3.2.12) 


The transfer function H(z) of the single-pole model is obtained from H(z) by letting N go to infinity. In the limit, H..(z) 
has an infinite number of zeros equally distributed on the circle | z |= a ; the zeros are everywhere on that circle except at the point 
z =a. Similarly, the denominator from (3.2.8), a polynomial with a single zero at z = a , can be written as 


1 (3.2.13) 


Pe T OE Ja|<1 


AO Sao 
n=0 


that is, a single zero can also be represented by an infinite number of poles. In this case, the poles are equally distributed on a circle 
that passes through the location of the zero; the poles are everywhere on the circle except at the actual location of the zero. 


Autocorrelation. The impulse response h(n) of an all-pole model has infinite duration so that its 
autocorrelation involves an infinite summation, which is not practical to write in closed form except for low-order 
models. However, the autocorrelation function obeys a recursive relation that relates the autocorrelation values to the 
model parameters. Multiplying (3.2.2) by h*(n—1) and summing over all n , we have 


x »y a,h(n—k)h' (n—1) =d, >" h* (n—l)d(n) (3.2.14) 


n=-00 k=0 n 


where ao =1. Interchanging the order of summations in the left-hand side, we obtain 
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P 
Sar, (l-k)=d,h*(-1)  -e@<I1<e (3.2.15) 
k=0 


where m(l) is the autocorrelation of h(n). Equation (3.2.15) is true for all Z, but because h(/)=0 for 1<0, 
h(-l)=0 for l>0, and we have 


P 
Sarn (d-k)=0 1>0 (3.2.16) 
k=0 
From (3.2.4) and (3.2.15), we also have for 1 =0, 
P 
Sar, (k) =|, P (3.2.17) 
k=0 
where we used the fact that r; (—/) = n (L) . Equation (3.2.16) can be rewritten as 
P 
nD=-J an (l-k) 1>0 (3.2.18) 
k=l 


which is a recursive relation for 7,(/) in terms of past values of the autocorrelation and {a,}. Relation (3.2.18) for 
n(1) is similar to relation (3.2.5) for h(n), but with one important difference: (3.2.5) for h(n) is true for all 
n#0 while (3.2.18) for »,(/) is true only if 1>0;for 1<0, 7,(1) obeys (3.2.15). 

If we define the normalized autocorrelation coefficients as 


7, (1) 
D ==> (3.2.19) 

P, 7,0) 

then we can divide (3.2.17) by 7,(0) and deduce the following relation for 7,(0) 

2 

7,(0) = |d | (3.2.20) 
1+9 a,p,(k) 
k=l 


which is the energy of the output of the all-pole filter when excited by a single impulse. 
Autocorrelation in terms of poles. The complex spectrum of the AP( P ) model is 


1 d 1 
R,(z) = H(z)H| — |=|d, ° | l-r 3.2.21) 

„a= Me (=) |d| lirei ; 
Therefore, the autocorrelation sequence can be expressed in terms of the poles by taking the inverse z -transform of 
R,(z), that is, n (1) =Z™{R,(z)}. The poles p, of the minimum-phase model H(z) contribute causal terms in 
the partial fraction expansion, whereas the poles 1/p, of the nonminimum-phase model H(1/z*) contribute 
noncausal terms. This is best illustrated with the following example. 


EXAMPLE 3.2.2. Consider the following minimum-phase AP( 1) model 

Be iaz (3.2.22) 

1+az 

Owing to causality, the ROC of H(z) is |z|>|a|.The Z -transform 

riety — -l<a<l (3.2.23) 

l+az 
corresponds to the noncausal sequence h(—n) = (—a) “u(—n) , and its ROC is | z|<1/|a|. Hence, 
1 
R,(z) = H(2)H (z7) =—_—__ (3.2.24) 
AmE e lan 


which corresponds to a two-sided sequence because its ROC, |a|<|z|<1/|a|, is a ring in the Z -plane. Using partial fraction 
expansion, we obtain 
-1 


—a z 1 1 
R le: EER Ea (3.2.25) 
a(z) l-a? 1l+az' 1-a? l+az 





The pole p=~—a corresponds to the causal sequence [1/(1—a*)](—a)'u(/—1), and the pole p=-—1/a to the noncausal 
sequence [1/(1—a’)](—a)'u(—l) . Combining the two terms, we obtain 
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n= a)" oele (3.2.26) 


or p, =a)!  — -0<1 <0 (3.2.27) 


Note that complex conjugate poles will contribute two-sided damped sinusoidal terms like the ones described in 
Section 3.1.2 for the AP(2) model. 

Impulse train excitations. The response of an AP( P ) model to a periodic impulse train with period L is 
periodic with the same period and is given by 


oat S ah(n—%) =d, X. 5(n+Lm) 


(3.2.28) 
_ p n+Lm=0 


0 n+Lm#0 


which shows that the prediction error is zero for samples inside the period and do at the beginning of each period. If 
we multiply both sides of (3.2.28) by h(n—l) and sum over a period 0<n< L—1, we obtain 


P 
Fl) + >. a,l -k) =“ KCD all (3.2.29) 
k=l 


where ř,(l) is the periodic autocorrelation of h(n) . Since, in contrast to h(n) in (3.2.15), h(n) is not 
necessarily zero for n < 0 , the periodic autocorrelation ;,(/) will not in general obey the linear prediction equation 
anywhere. Similar results can be obtained for harmonic process excitations. 


Model parameters in terms of autocorrelation. Equations (3.2.15) for 1=0, 1, ---, P comprise P+1 
equations that relate the P+1 parameters of H(z), namely, dọ and {a,,1<k<P}, to the first P+1 
autocorrelation coefficients 7 (0), 7,(1),--:,m(P). These P+1 equations can be written in matrix form as 


r, (0) nd) a nP) 1 ld? 
mn) AO -= 4 (P-I) |} a, | 0 (3.2.30) 
r (P) n(P-1) = 7,0) ja 0 


If we are given the first P+1 autocorrelations, (3.2.30) comprises a system of P+1 linear equations, with a 
Hermitian Toeplitz matrix that can be solved for dọ and {a;,}. 

Because of the special structure in (3.2.30), the model parameters are found from the autocorrelations by using 
the last set of P equations in (3.2.17), followed by the computation of dọ from the first equation, which is the 
same as (3.2.17). From (3.2.30), we can write in matrix notation 


R,a=-r, (3.2.31) 


where R, is the autocorrelation matrix, a is the vector of the model parameters, and r, is the vector of 
autocorrelations. Since r,(/) = 02n,(1), we can also express the model parameters in terms of the autocorrelation 
r,(1) of the output process x(n) as follows: 

Ra=-r, (3.2.32) 
These equations are known as the Yule-Walker equations in the statistics literature. In the sequel, we drop the 
subscript from the autocorrelation sequence or matrix whenever the analysis holds for both the impulse response and 
the model output. 

Because of the Toeplitz structure and the nature of the right-hand side, the linear systems (3.2.31) and (3.2.32) 
can be solved recursively by using the algorithm of Levinson-Durbin. After a is solved for, the system gain do 
can be computed from (3.2.17). 

Therefore, given r(0), r(1),---,r(P), we can completely specify the parameters of the all-pole model by 
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solving a set of linear equations. Below, we will see that the converse is also true: Given the model parameters, we 
can find the first P+1 autocorrelations by solving a set of linear equations. This elegant solution of the spectral 
factorization problem is unique to all-pole models. In the case in which the model contains zeros (Q #0), the 
spectral factorization problem requires the solution of a nonlinear system of equations. 
Autocorrelation in terms of model parameters. If we normalize the autocorrelations in (3.2.31) by dividing 
throughout by r(0), we obtain the following system of equations 
Pa=-p (3.2.33) 


where P is the normalized autocorrelation matrix and 


p=) p(2) = pP” (3.2.34) 


is the vector of normalized autocorrelations. This setof P equations relates the P model coefficients with the first 
P (normalized) autocorrelation values. If the poles of the all-pole filter are strictly inside the unit circle, the mapping 
between the P -dimensional vectors a and p is unique. If, in fact, we are given the vector a, then the 
normalized autocorrelation vector Ø can be computed from a by using the set of equations that can be deduced 
from (3.2.33) 
Ap=-a (3.2.35) 
where <A>j= di-j+dđi+j, assuming Am =0 for m<0 and m>P (see Problem 3.6). 
Given the set of coefficients in a, Ø can be obtained by solving (3.2.25). We will see that, under the 
assumption of a stable H(z), a solution always exists. Furthermore, there exists a simple, recursive solution that is 
efficient (see Section 6.5). If, in addition to a, we are given dy, we can evaluate r(0) with (3.2.20) from p 


computed by (3.2.25). Autocorrelation values r(/) forlags l> P are found by using the recursion in (3.2.18) with 
r(0), r(1),-*+,7(P) : 


EXAMPLE 3.2.3. For the AP(3) model with real coefficients we have 
r(0) rd) r(2)]\a, r(1) 


r1) r0) r@ |)a,|=-| r(2) (3.2.36) 
r(2) r) r0) ja r(3) 
di =r(0)+ar(1)+a,r(2)+a,r(3) (3.2.37) 


Therefore, given r(0), r(1), r(2), r(3), we can find the parameters of the all-pole model by solving (3.2.36) and then 
substituting into (3.2.37). 

Suppose now that instead we are given the model parameters dọ, a, a,az. If we divide both sides of (3.2.36) by r(0) and 
solve for the normalized autocorrelations (1), (2), and (3), we obtain 


l+a a, 0]! pil) a, 


a +a, l 0 p(2) =— 4, (3.2.38) 
a, a 1 p(3) a, 
The value of r(0) is obtained from 
2 
= eaaa i ae (3.2.39) 


i 1+a,p(1)+a,p(2)+a,p(3) 
If r(0)=2, r(1)=1.6, r(2)=1.2, and r(3)=1, the Toeplitz matrix in (3.2.36) is positive definite because it has positive 
eigenvalues. Solving the linear system gives a, =—0.9063, a= 0.2500, and a;= -—0.1563. Substituting these values in 
(3.2.37), we obtain dy = 0.8329 . Using the last two relations, we can recover the autocorrelation from the model parameters. 


Correlation matching. All-pole models have the unique distinction that the model parameters are completely 
specified by the first P+1 autocorrelation coefficients via a set of linear equations. We can write 


H E S (3.2.40) 
a P 


that is, the mapping of the model parameters {do, a, a2, ++, ap} to the autocorrelation coefficients specified 
by the vector {r(0), p(1), ---, p(P)} is reversible and unique. This statement implies that given any set of 
autocorrelation values r(0), r(1), +, r(P), we can always find an all-pole model whose first P+1 
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autocorrelation coefficients are equal to the given autocorrelations. This correlation matching of all-pole models is 
quite remarkable. This property is not shared by all-zero models and is true for pole-zero models only under certain 
conditions, as we will see in Section 3.4. 


Spectrum. The z-transform of the autocorrelation r(J) of H(z) is given by 


1 
R(z) = H(z)H (+) (3.2.41) 
Z 

The spectrum is then equal to 

lat 
| Ace”) f 
The right-hand side of (3.2.42) suggests a method for computing the spectrum: First compute A(e!”) by taking the 
Fourier transform of the sequence {1, a, +-+, ap}, then take the squared of the magnitude and divide | dy |’ by 
the result. The fast Fourier transform (FFT) can be used to this end by appending the sequence {1, a, ---, ap} 
with as many zeros as needed to compute the desired number of frequency points. 


Partial autocorrelation and lattice structures. We have seen that an AP( P ) model is completely described by 
the first P+1 values of its autocorrelation. However, we cannot determine the order of the model by using the 
autocorrelation sequence because it has infinite duration. Suppose that we start fitting models of increasing order m , 
using the autocorrelation sequence of an AP( P ) model and the Yule-Walker equations 


R(e!®) =| H (e°) [f= (3.2.42) 


1 pd) = pml] [o 
PM 1 jag |__| C) (3.2.43) 
p4) l 
pm-) = PD 1 ja p`(m) 


Recall the relationship of the coefficients between that of a direct-form filter and that of an lattice filter. 
a” =k, (3.2.44) 


that is, the PACS is identical to the lattice parameters. A statistical definition and interpretation of the PACS are also 
given in Chapter 7. The PACS can be defined for any valid (i.e., positive definite) autocorrelation sequence and can 
be efficiently computed by using the algorithms of Levinson-Durbin and Schur (see Chapter 6). 

Furthermore, it has been shown (Burg 1975) that 


E l-|k,„ | : E 1+|k,, | 
0 —™< R(e’”) < r(0 a1 (3.2.45) 
"Ol SFOs OU 


which indicates that the spectral dynamic range increases if some lattice parameter moves close to 1 or equivalently 
some pole moves close to the unit circle. 


Equivalent model representations. From the previous discussions we conclude that a minimum-phase AP( P ) 
model can be uniquely described by any one of the following representations: 
1. Direct structure: {do, a1, a2," *-,ap} 
2. Lattice structure: {do,ki,k2,---,kp} 
3. Autocorrelation: {r(0), r(1),---,r(P)} 
where we assume, without loss of generality, that do > (0. Note that the minimum-phase property requires that all 
poles be inside the unit circle or all |k,, |<1 or that Rp, be positive definite. The transformation from any of the 
above representations to any other can be done by using the algorithms developed in Section 6.5. 


Minimum-phase conditions. As we will show in Section 6.5, if the Toeplitz matrix R, (or equivalently R,) 
is positive definite, then |k,, |<1 forall m=1, 2, ---, P. Therefore, the AP( P ) model obtained by solving the 
Yule-Walker equations is minimum-phase. Therefore, the Yule-Walker equations provide a simple and elegant 
solution to the spectral factorization problem for all-pole models. 


EXAMPLE 3.2.4. The poles of the model obtained in Example 3.2.3 are 0.8316, 0.0373+0.4319:, and 0.0373 —0.4319i . We see 
that the poles are inside the unit circle and that the autocorrelation sequence is positive definite. If we set 7,(2)=—1.2, the 
autocorrelation becomes negative definite and the obtained model a=[1 —1.222 1.15757 , do=2.2271, is 
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nonminimum-phase. 


Pole locations. The poles of H(z) are the zeros {p,} of the polynomial A(z). If the coefficients of A(z) 
are assumed to be real, the poles are either real or come in complex conjugate pairs. In order for H(z) to be 
minimum-phase, all poles must be inside the unit circle, that is, | py |< 1. The model parameters a, can be written 
as sums of products of the poles px . In particular, it is easy to see that 


B 
a,=-> p, (3.2.46) 
k=l 
p 
ap =[[(p,) (3.2.47) 
k=l 


Thus, the first coefficient a, is the negative of the sum of the poles, and the last coefficient ap is the product of the 
negative of the individual poles. Since | p, |< 1, we must have | ap |<1 for a minimum-phase polynomial for which 
ao =1. However, note that the reverse is not necessarily true: |ap|<1 does not guarantee minimum phase. The 
roots Pp, can be computed by using any number of standard root-finding routines. 


3.2.2 All-Pole Modeling and Linear Prediction 
Consider the AP( P ) model 
x(n) = 3 a,x(n—k)+w(n) (3.2.48) 
k=l 


Now recall from Chapter 1 that the M th-order linear predictor of x(n) and the corresponding prediction error 
e(n) are 


R(n) = ay a, x(n—k) (3.2.49) 
k=l 
e(n) = x(n) — X(n) = x(n) + F a,x(n—k) (3.2.50) 
k=l 
M 
or x(n)=}_ apx(n-k)+e(n) (3.2.51) 
k=l 


Notice that if the order of the linear predictor equals the order of the all-pole model (M=P) and if al = a; , then the 
prediction error is equal to the excitation of the all-pole model, that is, e(n) = w(n). Since all-pole modeling and 
FIR linear prediction are closely related, many properties and algorithms developed for one of them can be applied to 
the other. Linear prediction is extensively studied in Chapters 5 and 6. 


3.2.3 Autoregressive Models 


Causal all-pole models excited by white noise play a major role in practical applications and are known as 
autoregressive (AR) models. An AR(P) model is defined by the difference equation 


P 
x(n) = -5 a,x(n—k)+w(n) (3.2.52) 
k=l 


where {w(n)}~ WN (0,02). An AR(P) model is valid only if the corresponding AP(P) system is stable. In this 
case, the output x(n) is a stationary sequence with a mean value of zero. Postmultiplying (3.2.52) by x*(n—1) 
and taking the expectation, we obtain the following recursive relation for the autocorrelation: 


P 
r) =-9_ a,r,(l—k) + E{w(n)x (n-1)} (3.2.53) 
k=1 
Similarly, using (3.1.1), we can show that E{@(n)x*(n—1)}=oż}hk* (-1) . Thus, we have 
P 
r(l)= -5 a,r(l1—k)+o2h' (-l) for all / (3.2.54) 
k=l 


The variance of the output signal is 
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P 

o: =r (0)=-) ar (k) +o, 
k=l 

(ox 


2p 
1+% a,p,(k) 
k=l 


If we substitute /=0, 1, ..., P in (3.2.55) and recall that h(n)=0O for n<0, we obtain the following set of 
Yule-Walker equations: 


or o? 


x 


(3.2.55) 


r (0) RO) -= EP) 1 o? 
r r(0) = nen a = g (3.2.56) 
r(P) r(P-1) = (0) jap 0 


Careful inspection of the above equations reveals their similarity to the corresponding relationships developed 
previously for the AP(P) model. This should be no surprise since the power spectrum of the white noise is flat. 
However, there is one important difference we should clarify: AP(P) models were specified with a gain do and the 
parameters {a), a2, -::, ap}, but for AR(P) models we set the gain dọ =1 and define the model by the variance 
of the white excitation gẹ and the parameters {a, a), +-+, ap}. In other words, we incorporate the gain of the 
model into the power of the input signal. Thus, the power spectrum of the output is R, (e}?) =o. | H(e!”) P . Similar 
arguments apply to all parametric models driven by white noise. We just rederived some of the relationships to clarify 
these issues and to provide additional insight into the subject. 


3.2.4 Lower-Order Models 


In this section, we derive the properties of lower-order all-pole models, namely, first- and second-order models, with 
real coefficients. 


First-order all-pole model: AP(1) 


An AP(1) model has a transfer function 





H(2)=—2— (3.2.57) 
1+az 
with a single pole at z =—a on the real axis. It is clear that H(z) is minimum-phase if 
-l<a<\l (3.2.58) 
From (3.2.18) with P=1 and /]=1, we have 
PEELA. en (3.2.59) 
r(0) 
Similarly, from (3.2.44) with m=1, 
a” =a =-p(1)= k, (3.2.60) 


Since from (3.2.4), h(0)=do, and from (3.2.5) h(n)=—mh(n-1) for n>0, the impulse response of a 
single-pole filter is given by 


h(n) = d,(—a)"u(n) (3.2.61) 


The same result can, of course, be obtained by taking the inverse z-transform of H(z). 
The autocorrelation is found in a similar fashion. From (3.2.18) and by using the fact that the autocorrelation is 
an even function, 
r()=r(0)(-a)"" forall Z (3.2.62) 
and from (3.2.20) 
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d? d? 
rOj=— = (3.2.63) 
o l-a 1-k 
Therefore, if the energy r(0) in the impulse response is set to unity, then the gain must be set to 
d, =;/1-kf r(0)=1 (3.2.64) 
The z-transform of the autocorrelation is then 
d? co 
R(z) =————_ = r (0) 9} a)" z” (3.2.65) 
já (1+az')(1+az) pa 
and the spectrum is 
f ; d? d? 
R(e!”) =| He”) P= 0 = = (3.2.66) 


\lt+ae??? 1+2acoswt+a* 

Figures 3.4 and 3.5 show a typical realization of the output, the impulse response, autocorrelation, and spectrum 
of two AP(1) models. The sample process realizations were obtained by driving the model with white Gaussian noise 
of zero mean and unit variance. When the positive pole ( p =—a = 0.8) is close to the unit circle, successive samples 
of the output process are similar, as dictated by the slowly decaying autocorrelation and the corresponding low-pass 
spectrum. In contrast, a negative pole close to the unit circle results in a rapidly oscillating sequence. This is clearly 
reflected in the alternating sign of the autocorrelation sequence and the associated high-pass spectrum. 
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FIGURE 3.4 

Sample realization of the output process, impulse response, autocorrelation, and spectrum of an AP(1) model with a = -—0.8. 

Note that a positive real pole is a type of low-pass filter, while a negative real pole has the spectral 
characteristics of a high-pass filter. (This situation in the digital domain contrasts with that in the corresponding 
analog domain where a real-axis pole can only have low-pass characteristics.) The discrete-time negative real pole 
can be thought of as one-half of two conjugate poles at half the sampling frequency. Notice that both spectra are even 
and have zero slope at @=0 and @= M . These propositions are true of the spectra of all parametric models (i.e., 
pole-zero models) with real coefficients (see Problem 3.13). 

Consider now the real-valued AR(1) process x(n) generated by 

x(n) = —ax(n—1)+ w(n) (3.2.67) 
where {w(n)}~WN(0,o;.) . Using the formula R,(z)= 0H (z)H*(1/z*) and previous results, we can see that the 
autocorrelation and the PSD of x(n) are given by 

2 
r= == (aM 


—a 





jo 2 =g 
and Re E aa 
1l+a°+2acos@ 


respectively. Since o? = r (0) =02/(1—a’), the SFM of x(n) is [see (Section 3.1.18)] 
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FIGURE 3.5 
Sample realization of the output process, impulse response, autocorrelation, and spectrum of an AP(1) model with a=0.8. 
2 
oO 
SFM, = — =1-q (3.2.68) 


Clearly, if a= 0 , then from (3.2.67), x(n) is a white noise process and from (3.2.68), SFM,=1. If a— 1, then 
SFM, — 0; and in the limit when a=1, the process becomes a random walk process, which is a nonstationary 
process with linearly increasing variance E{x*(n)}=noz. The correlation matrix is Toeplitz, and it is a rare 
exception in which eigenvalues and eigenvectors can be described by analytical expressions (Jayant and Noll 1984). 


Second-order all-pole model: AP(2) 


The system function of an AP(2) model is given by 
d, 
H(z) a E (3.2.69) 
It+az +a,z (l-p,z )-p,z ) 
From (3.2.46) and (3.2.47), we have 
% =P, + Pr) (3.2.70) 
a, = P\P2 
Recall that H(z) is minimum-phase if the two poles p, and p, are inside the unit circle. Under these conditions, 
a, and a lie ina triangular region defined by 
-l<a,<l 
a,—a,>-1 (3.2.71) 
a,+a,>-l 
and shown in Figure 3.6. The first condition follows from (3.2.70) since | p;|<1 and | p2|<1. The last two 
conditions can be derived by assuming real roots and setting the larger root to less than 1 and the smaller root to 


greater than —1. By adding the last two conditions, we obtain the redundant condition a, >-—1. 
Complex roots occur in the region 


2 
i. <a, S1 complex poles (3.2.12) 


with a, =1 resulting in both roots being on the unit circle. Note that, in order to have complex poles, a, cannot be 
negative. If the complex poles are written in polar form 


p,=re? = OS rl (3.2.73) 
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Complex conjugate poles 


Real and equal poles 





FIGURE 3.6 
Minimum-phase region (triangle) for the AP(2) model in the (q@,,a,) parameter space. 
then a, =—2r cos 0 a, = r’ (3.2.74) 
do 
and H(z)= complex poles (3.2.75) 


1-(2rcos0@)z'+r7z7 


Here, r is the radius (magnitude) of the poles, and @ is the angle or normalized frequency of the poles. 
Impulse response. The impulse response of an AP(2) model can be written in terms of its two poles by 
evaluating the inverse z -transform of (3.2.69). The result is 





h(n) = do (pr! — ph" )u(n) (3.2.76) 
Pi- P2 
for pı # p2. Otherwise, for p, = p =p, 
h(n) = d,(n +1) p”u(n) (3.2.77) 
In the special case of a complex conjugate pair of poles p, =re}? and p, =re~/®, Equation (3.2.76) reduces to 
h(n) =d,r" SUDA a(n) complex poles (3.2.78) 
in 


Since 0<r<1, h(n) is adamped sinusoid of frequency @. 
Autocorrelation. The autocorrelation can also be written in terms of the two poles as 


2 l+ l+ 
ije — A 5 me. = ) 120 (3.2.79) 
(p -P)0- pp) l-p; 1l-p 


from which we can deduce the energy 
r(0) = At (3.2.80) 
(l= p,p,)0= p; = p3) 
For the special case of a complex conjugate pole pair, (3.2.79) can be rewritten as 
HO = dor’ {sin[(1+1)6]—r’ sin[(/-1)6]} 
[(—r*)sin 6](1—2r? cos 20 + r*) 


Then from (3.2.80) we can write an expression for the energy in terms of the polar coordinates of the complex 
conjugate pole pair 


120 (3.2.81) 


2 2 
a (3.2.82) 
(l-r’)(1—2r° cos20+r°) 
The normalized autocorrelation is given by 
ly: ere = 
pant Sale Del Femi tsa (3.2.83) 


(l+r*)sin@ 
which can be rewritten as 
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pl)= Lees (0-8) 120 (3.2.84) 
cos 8 
2 
where en pt eee (3.2.85) 
(1+r’)sin@ 


Therefore, p(l) is a damped cosine wave with its maximum amplitude at the origin. 


Spectrum. By setting the two poles equal to 
p= p =e" (3.2.86) 


the spectrum of an AP(2) model can be written as 


2 
Re”) ee ee eee. ee ae (3.2.87) 
[1—27, cos(@—6,) +17; |[1—27, cos(@—-8,) +7 ] 


There are four cases of interest 












Pole locations Type of R(e”) 


p,>0, p,>0 Low-pass 
p,<0, p,<0 High-pass 
Stopband 


Bandpass 





and they depend on the location of the poles on the complex plane. 
We concentrate on the fourth case of complex conjugate poles, which is of greatest interest. The other three 
cases are explored in Problem 3.15. The spectrum is given by 
2 
Re’) = a SOE (3.2.88) 
{1-—2rcos(@—@)+r° ][l—2rcos(@+6)+r° ] 


The peak of this spectrum can be shown to be located at a frequency @., given by 
2 

















r 
cos Ø, = cos 0 (3.2.89) 
2r 
Since 1+r?° >2r for r<1,and we have 
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Sample realization of the output process, impulse response, autocorrelation, and spectrum of an AP(2) model with complex 
conjugate poles. 
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cos æ, > cos 8 (3.2.90) 


the spectral peak is lower than the pole frequency for 0<@<z/2 and higher than the pole Treueney for 
WI12<O<z. 

This behavior is illustrated in Figure 3.7 for an AP(2) model with a, =—0.4944, a, =0.64, and dy) =1. The 
model has two complex conjugate poles with r=0.8 and @=+277/5. The spectrum has a single peak and 
displays a passband type of behavior. The impulse response is a damped sine wave while the autocorrelation is a 
damped cosine. The typical realization of the output shows clearly a pseudoperiodic behavior that is explained by the 
shape of the autocorrelation and the spectrum of the model. We also notice that if the poles are complex conjugates, 
the autocorrelation has pseudoperiodic behavior. 


Equivalent model descriptions. We now write explicit formulas for a, and a, in terms of the lattice 
parameters kı and k, and the autocorrelation coefficients. From the step-up and step-down recursions, we have 








a =KO+k) (3.2.91) 
a, =k, 
and the inverse relations 
k, =£ 
l+a, (3.2.92) 
k, =a, 
From the Yule-Walker equations (3.2.18), we can write the two equations 
ar(0)+ar(1)=-r(1) (3.2.93) 
a,r(1) +a,r(0) =—r(2) 
which can be solved for a, and a, intermsof p(l) and (2) 
1-p(2 
= pil) = p( An 
P) (3.2.94) 
_ Pp ()~p(2) 
* 1-p7*(1) 
orfor p(1) and p(2) intermsof a and a 
pil) =- 
i+ as (3.2.95) 





p(2)=~a,p(1)—4, = 


-a 
lt+a, ” 


From the equations above, we can also write the relation and inverse relation between the coefficients k, and kz 
and the normalized autocorrelations p(1) and p(2) as 


k, =-p(1) 
_ P-P (3.2.96) 

a ee -p° (1) 

and p(l) =-k, 
p(2)=k,(1+k,)—k, (3.2.97) 


The gain dọ can also be written in terms of the other coefficients. From (3.2.20), we have 
dè =r(0)[1+a,p(1) +a, p(2)] (3.2.98) 
which can be shown to be equal to 


dè =r(0)(1-k,)\(1—k,) (3.2.99) 
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Minimum-phase conditions. In (3.2.71), we have a set of conditions on a, and a, so that the AP(2) model is 
minimum-phase, and Figure 3.6 shows the corresponding admissible region for minimum-phase models. Similar 
relations and regions can be derived for the other types of parameters, as we will show below. In terms of k, and 
ka , the AP(2) model is minimum-phase if 





|k \<1 |k, |<1 (3.2.100) 
This region is depicted in Figure 3.8(a). Shown also is the region that results in complex roots, which is specified by 
0<k, <1 (3.2.101) 
ge (3.2.102) 
(1+k,) 


Because of the correlation matching property of all-pole models, we can find a minimum-phase all-pole model for 
every positive definite sequence of autocorrelation values. Therefore, the admissible region of autocorrelation values 
coincides with the positive definite region. The positive definite condition is equivalent to having all the principal 
minors of the autocorrelation matrix in (3.2.30) be positive definite; that is, the corresponding determinants are 
positive. For P = 2 , there are two conditions: 


1 pd) p 


a t <1 det| pd) 1 p@)|<1 (3.2.103) 
pa) 1 
p(2) pu) 1 
These two conditions reduce to 
| pd) |<1 (3.2.104) 
2p’ (1)- 1< p(2)<1 (3.2.105) 


which determine the admissible region shown in Figure 3.8(b). Conditions (3.2.105) can also be derived from (3.2.71) 
and (3.2.95). The first condition in (3.2.105) is equivalent to 


a 
l+a, 


<1 (3.2.106) 











which can be shown to be equivalent to the last two conditions in (3.2.71). 

It is important to note that the region in Figure 3.8(b) is the admissible region for any positive definite 
autocorrelation, including the autocorrelation of mixed-phase signals. This is reasonable since the autocorrelation 
does not contain phase information and allows the signal to have minimum- and maximum-phase components. What 
we are claiming here, however, is that for every autocorrelation sequence in the positive definite region, we can find a 
minimum-phase all-pole model with the same autocorrelation values. Therefore, for this problem, the positive 
definite region is identical to the admissible minimum-phase region. 
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s 0 go 
—0.5 —0.5 
-1.0 = -1.0 
=1.0 1.0 





FIGURE 3.8 
Minimum-phase and positive definiteness regions for the AP(2) model in the (a) (kı, kı) space and (b) (p(1), p(2)) space. 


CHAPTER3 Linear Signal Models 77 


3.3 All-Zero Models 


In this section, we investigate the properties of the all-zero model. The output of the all-zero model is the weighted 
average of delayed versions of the input signal 


Q 
x(n) =)" d,an-k) (3.3.1) 
k=0 
where Q is the order of the model. The system function is 
Q 
H(z)=D(z)=} 4,2" (3.3.2) 
k=0 


The all-zero model can be implemented by using either a direct or a lattice structure. The conversion between the two 
sets of parameters can be done by using the step-up and step-down recursions described in Chapter 6 and setting 
A(z) = D(z) . Notice that the same set of parameters can be used to implement either an all-zero or an all-pole model 
by using a different structure. 


3.3.1 Model Properties 


We next provide a brief discussion of the properties of the all-zero model. 


Impulse response. It can be easily seen that the AZ(Q) model is an FIR system with an impulse response 
d O<sns 
h(n)=4 * Q (3.3.3) 
0 elsewhere 
Autocorrelation. The autocorrelation of the impulse response is given by 


Q-! 
= d,d} O</l< 
nD = >) hinh (n—1) = Zdi s Se 
my 0 l>Q 
and i, (lD=7r0 all / (3.3.5) 
We usually set dọ =1, which implies that 
EW =; +6 Gig to +d do 1=0, l, =, Q (3.3.6) 
hence, the normalized autocorrelation is 
p@= di +didi, +: +d d} /1+|d, F ++|d f M, 2, = Q (3.3.7) 
0 l>Q 


We see that the autocorrelation of an AZ( Q ) model is zero for lags |Z| exceeding the order Q of the model. If 
Pr(l), Pn(2),---,P.(Q) are known, then the Q equations (3.3.7) can be solved for model parameters 
dı, d2, +--+, d,. However, unlike the Yule-Walker equations for the AP( P ) model, which are linear, Equations 
(3.3.7) are nonlinear and their solution is quite complicated. 


Spectrum. The spectrum of the AZ(Q) model is given by 


: , Q : 
R, j= D(z)D(z") | nce =| D(e’”) f= >, r (le (3.3.8) 
I=-Q 
which is basically a trigonometric polynomial. 


Impulse train excitations. The response h(n) of the AZ(Q) model to a periodic impulse train with period L 
is periodic with the same period, and its spectrum is a sampled version of (3.3.8) at multiples of 22/L. Therefore, to 
recover the autocorrelation 7,(/) and the spectrum R,(e%”) from the autocorrelation or spectrum of h(n), we 
should have L>2Q+1 in order to avoid aliasing in the autocorrelation lag domain. Also, if L>Q, the impulse 
response h(n), 0 < n < Q, can be recovered from the response h(n) (no time-domain aliasing) (see Problem 
3.24). 
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Partial autocorrelation and lattice-ladder structures. The PACS of an AZ( Q ) model is computed by fitting a 
series of AP( P ) models for P=1, 2, ---, to the autocorrelation sequence (3.3.7) of the AZ(Q) model. Since the 
AZ(Q) model is equivalent to an AP(<) model, the PACS of an all-zero model has infinite extent and behaves as 
the autocorrelation sequence of an all-pole model. This is illustrated later for the low-order AZ(1) and AZ(2) models. 


3.3.2 Moving-Average Models 
A moving-average model is an AZ(Q) model with dp =1 driven by white noise, that is, 
e BB (3.3.9) 
where {w(n)}~ WN(0,o2).The output x(n) has zero a and variance of 
o? =y. |d, P (3.3.10) 
k=0 


The autocorrelation and power spectrum are given by r,(/)=o.n(l) and R,(e’”)=o0. | D(e’”)/ , respectively. 
Clearly, observations that are more than Q samples apart are uncorrelated because the autocorrelation is zero after lag Q. 


3.3.3 Lower-Order Models 


To familiarize ourselves with all-zero models, we next investigate in detail the properties of the AZ(1) and AZ(2) 
models with real coefficients. 


The first-order all-zero model: AZ(1). For generality, we consider an AZ(1) model whose system function is 
H(z)=G(1+d,z"') (3.3.11) 


The model is stable for any value of dı and minimum-phase for —1< d, <1. The autocorrelation is the inverse 
z -transform of 


R,(z)=H(z)H(z')=G'[d,z+(1t+d))+4,z"] (3.3.12) 


Hence, n (0) = G?(1+d?), n,(1) =n, (-1) =G’d,, and n,(1)=0 elsewhere. Therefore, the normalized autocorrelation 
is 





1 1=0 
nat l-4 (3.3.13) 
pM) 1+d? 
0 I] 22 


The condition —1<d;<1 implies that | ,(1)|S1/2 for a minimum-phase model. From p, (1) = d,/(1+ dř), we 
obtain the quadratic equation 
P,(1)d; —d, + p, (1) =0 (3.3.14) 
which has the following two roots: 
+ <i 2 
d, = 1lty1-4p, 0) (3.3.15) 
2p, (I) 


Since the product of the roots is 1, if d, is a root, then 1/d, must also be a root. Hence, only one of these two roots 
can satisfy the minimum-phase condition —1<d, <1. 
The spectrum is obtained by setting z =e}? in (3.3.12), or from (3.3.8) 


R,(e'®) = G° (+d? + 2d, cos Ø) (3.3.16) 
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The autocorrelation is positive definite if R, (e!”) >0, which holds for all values of d,. Note that if d, >0, then 
Pn(1) >0 and the spectrum has low-pass behavior (see Figure 3.9), whereas a high-pass spectrum is obtained when 


dı <0 (see Figure 3.10). 
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FIGURE 3.9 
Sample realization of the output process, ACS, PACS, and spectrum of an AZ(1) model with d, =0.95. 
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FIGURE 3.10 

Sample realization of the output process, ACS, PACS, and spectrum of an AZ(1) model with d, =-0.95. 
The first lattice parameter of the AZ(1) model is kı =d,. The PACS can be obtained from the Yule-Walker 
equations by using the autocorrelation sequence (3.3.13). Indeed, after some algebra we obtain 


a)" A-d) m=1, pA dears fess (3.3.17) 


9 


k 


m 1 = daome 


(see Problem 3.25). Notice the duality between the ACS and PACS of AP(1) and AZ(1) models. 
Consider now the MA(1) real-valued process x(n) generated by 
x(n) = w(n)+ bw(n-1) 


where {w(n)} ~ WN (0,02) . Using R,(z)=0.H(z)H(1/z*), we obtain the PSD function 
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R (e°) =02(1+b? +2bcos @) 


which has low-pass (high-pass) characteristics if 0<b<1 (-1<b<0O). Since of =7,(0)=03(1+b’), we have 
(see Section 3.1.18) 





~ 1. + 


which is maximum for b=0 (white noise). The correlation matrix is banded Toeplitz (only a number of diagonals 
close to the main diagonal are nonzero) 


1 b 0 0 
bilib 0 

R, =0}(1+b°)|0 b 1 0 (3.3.19) 
000- 1 


and its eigenvalues and eigenvectors are given by 4 =R,(e’), qP =sina@n, @ =7k/(M +1), where 
k=1, 2, -:-, M (see Problem 3.30). 


The second-order all-zero model: AZ(2). Now let us consider the second-order all-zero model. The system 
function of the AZ(2) model is 





H(z)=G(1+d,z'+d,z7) (3.3.20) 
The system is stable for all values of d, and d,, and minimum-phase [see the discussion for the AP(2) model] if 
—l<d, <1 P 
d,—d,>-1 (3.3.21) 
d,+d,>-1 0.5 
which is a triangular region identical to that shown in Figure 3.6. The SA 
normalized autocorrelation and the spectrum are g 0 
1 l=0 
-0.5 
aia) l= 
_]i+di +d, (3.3.22) 
Pr, (= d -1.0 
ae) ae 1=12 -1.0 05 0 0.5 1 
l+d; +d; p(l) 
0 jl] 23 FIGURE 3.11 
Minimum-phase region in the 
and R,(e!”) =G"[(1+d; +d;)+2d,(1+d,)cos a+ 2d, cos 2@] (3.3.23) a as 


respectively. 
The minimum-phase region in the autocorrelation domain is shown in Figure 3.11 and is described by the 
equations 


p(2)+ p(l) =-0.5 
p(2)- pC) =-0.5 (3.3.24) 
P0) =4p(2)[1-2p(2)] 


derived in Problem 3.26. The formula for the PACS is quite involved. The important thing is the duality between the 
ACS and the PACS of AZ(2) and AP(2) models (see Problem 3.27). 


3.4 Pole-Zero Models 


We will focus on causal pole-zero models with a recursive input-output relationship given by 
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P Q 
x(n) =>) a,x(n—k)+ >) d @(n—k) (3.4.1) 
k=l k=0 


where we assume that P>Q and Q 21. The models can be implemented using either direct-form or lattice-ladder 
structures (Proakis and Manolakis 1996). 


3.4.1 Model Properties 


In this section, we present some of the basic properties of pole-zero models. 
Impulse response. The impulse response of a causal pole-zero model can be written in recursive form from 
(3.4.1) as 


~ 
h(n) =-))a,h(n—-k)+d, n20 (3.4.2) 
k=1 
where d,=0 n>Q 
and h(n)=0 for n<0. Clearly, this formula is useful if the model is stable. From (3.4.2), it is clear that 
P 
h(n)=-ġ_ a,h(n-k) n2Q (3.4.3) 
k=l 


so that the impulse response obeys the linear prediction equation for n > Q. Thus if we are given h(n), 0 < n < P+Q, 
we can compute {a} from (3.4.3) by using the P equations specified by Q+1 < n < Q+P. Then we can compute {d;} 
from {3.4.2}, using 0 < n < Q. Therefore, the first P+Q+1 values of the impulse response completely specify the 
pole-zero model. 

If the model is minimum-phase, the impulse response of the inverse model h; (n) = Z'{A(z)/D(z)}, do =1 
can be computed in a similar manner. 


Autocorrelation. The complex spectrum of H(z) is given by 
of 1 D(z)D* (1/z*) a R 
R,(z)=H(2)H (ezus KO) 
z A(z)A*(1/z*) R,(2) 


where Ri(z) and R,(z) are both finite two-sided polynomials. In a manner similar to the all-pole case, we can 
write a recursive relation between the autocorrelation, impulse response, and parameters of the model. Indeed, from 
(3.4.4) we obtain 





(3.4.4) 





A(z)R, (z) = por | l 


* 


(3.4.5) 


Taking the inverse z -transform of (3.4.5) and noting that the inverse z-transform of H(1/z") is h’(—n), we have 
P Q 
Sar, (l-k)=9_d,h'(k-1) forall! (3.4.6) 
k=0 k=0 


Since A(n) is causal, we see that the right-hand side of (3.4.6) is zero for 1>Q: 
P 
Zan (d-k)=0 l>Q (3.4.7) 
k=0 


Therefore, the autocorrelation of a pole-zero model obeys the linear prediction equation for 1>Q. 

Because the impulse response h(n) is afunctionof a, and d,, the set of equations in (3.4.6) is nonlinear in 
terms of parameters a, and d,. However, (3.4.7) is linear in a, ; therefore, we can compute {a,} from (3.4.7), 
using the set of equations for l = Q +1,---,Q + P , which can be written in matrix form as 


r,(Q) n(Q-1) = 4(Q+P-N]fa]  [4(@-1 
7,(Q-1) iQ) = 4(Q+P-2)|/a,|__| 4(Q-2) (3.48) 


n(Q-P+1) n(Q—-P+2) -- 7,(Q) ap 1,(Q-P) 
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or Re@=-7T (3.4.9) 


Here, R, is a non-Hermitian Toeplitz matrix, and the linear system (3.4.8) can be solved by using the algorithm of 
Trench (Trench 1964; Carayannis et al. 1981). 

Even after we solve for a, (3.4.6) continues to be nonlinear in d,. To compute dą, we use (3.4.4) to find 
Ra(z) 


R, (Zz) =R,(2)R, (2) (3.4.10) 
where the coefficients of R,(z) are given by 
r, (l= Sats -P <I <P (3.4.11) 
From (3.4.10), ra (l) is the convolution of r, y 5 aih r(1), given by 
r,()= e} r (k)r, (l-k) (3.4.12) 
k=-P 


If r(/) was originally the autocorrelation of a PZ( P,Q) model, then r,(/) in (3.4.12) will be zero for |/|>Q. 
Since R,(z) is specified, it can be factored into the product of two polynomials D(z) and D*(1/z*), where 
D(z) is minimum-phase. 

Therefore, we have seen that, given the values of the autocorrelation 7,(/) of a PZ( P,Q ) model in the range 
0</<P+Q, we can compute the values of the parameters {a,} and {d,} such that H(z) is minimum-phase. 
Now, given the parameters of a pole-zero model, we can compute its autocorrelation as follows. Equation (3.4.4) can 
be written as 


R,(z) =R,'(2)R, (2) (3.4.13) 


where R,'(z) is the spectrum of the all-pole model 1/ A(z), that is, 1/R,(z). The coefficients of R,'(z) can be 
computed from {a,} by using (3.2.20) and (3.2.18). The coefficients of .R;(z) are computed from (3.3.8). Then 
R,(z) is the convolution of the two autocorrelations thus computed, which is equivalent to multiplying the two 
polynomials in (3.4.13) and equating equal powers of z on both sides of the equation. Since R,(z) is finite, the 
summations used to obtain the coefficients of R,(z) are also finite. 


EXAMPLE 3.4.1. Consider a signal that has autocorrelation values of 7,(0)=19, 7,(1)=9, n,(2)=—5,and n (3)=-7 . The 
parameters of the PZ(2, 1) model are found in the following manner. First form the equation from (3.4.8) 


s seb] 


which yields a, =—1/2, a,=1/2. Then we compute the coefficients from (3.4.11), r,(0)=3/2, r (+1)=-3/4, and 


a 








a, 


r, (+2) =1/2. Computing the convolution in (3.4.12) for | < Q =1, we obtain the following polynomial: 
R,(z)=4z+10+4z" safii \e+2 
z 


Therefore, D(z) is obtained by taking the causal part, that is, D(z) =2[1+1/(2z™')], and d,=1/2. 
Spectrum. The spectrum of H(z) is given by 


joz |2 
R, (e) =| H(e’”) P= Lael (3.4.14) 
| A(e”*) | 

Therefore, R,(e!”) can be obtained by dividing the spectrum of D(z) by the spectrum of A(z). Again, the FFT 
can be used to advantage in computing the numerator and denominator of (3.4.14). If the spectrum R,(e!”) of a 
PZ( P, Q) model is given, then the parameters of the (minimum-phase) model can be recovered by first computing 
the autocorrelation m (l) as the inverse Fourier transform of R,,(e!”) and then using the procedure outlined in the 
previous section to compute the sets of coefficients {a,} and {d,}. 


Partial autocorrelation and lattice-ladder structures. Since a PZ( P,Q) model is equivalent to an AP(cc) 
model, its PACS has infinite extent and behaves, after a certain lag, as the PACS of an all-zero model. 
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3.4.2 Autoregressive Moving-Average Models 


The autoregressive moving-average model is a PZ( P,Q ) model driven by white noise and is denoted by 
ARMA( P,Q ). Again, we set dọ =1 and incorporate the gain into the variance (power) of the white noise excitation. 
Hence, a causal ARMA( P,Q ) model is defined by 


Q 
x(n) = + a,x(n—k)+ @n) +) d,an-k) (3.4.15) 
k=l k=l 


where {@(n)}~WN (0,03). The ARMA(P,Q) model parameters are {0}, a,...,ap, diı,.. do}. The output 
has zero mean and variance of 


P Q 
o =-} ar, (k)+ [+9 d,h(k)] (3.4.16) 
k=l k=l 


where h(n) is the impulse response of the model. The presence of h(n) in (3.4.16) makes the dependence of o? 
on the model parameters highly nonlinear. The autocorrelation of x(n) is given by 


È Q 
Yar d-W=o3) 1+¥ dhe) for all / (3.4.17) 
k=0 k=l 
and the power spectrum by 
j@y 2 
R, (e12) = 02 PEO (3.4.18) 
| Ace) | 


The significance of ARMA(P, Q) models is that they can provide more accurate representations than AR or MA 
models with the same number of parameters. The ARMA model is able to combine the spectral peak matching of the 
AR model with the ability of the MA model to place nulls in the spectrum. 


3.4.3 The First-Order Pole-Zero Model: PZ(1, 1) 


Consider the PZ(1, 1) model with the following system function 
l+d,z" 


1+a,z 
where dı and aq, are real coefficients. The model is minimum-phase if 
iia las (3.4.20) 
-l<a<l 
which correspond to the rectangular region shown in Figure 3.12(a). 
1.0 1.0 
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FIGURE 3.12 
Minimum-phase and positive definiteness regions for the PZ(1,1) model in the (a) (d,,a,) space and (b) (p(1), p(2)) space. 


For the minimum-phase case, the impulse responses of the direct and the inverse models are 
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0 n<O 
h(n) = Z"'{H(z)}=4G n=0 (3.4.21) 
G(-a, i (d, —a,) n>0 





0 n<0 

and h,(n)=Z" {a5} =1G a= (3.4.22) 
G(-d,)""(a,-d,)  n>0 

respectively. We note that as the pole p= —a; gets closer to the unit circle, the impulse response decays more 


slowly and the model has “longer memory.” The zero z= —d, controls the impulse response of the inverse model in 
a similar way. The PZ(1, 1) model is equivalent to the AZ(c ) model 


x(n) =Gatn)+G>" h(k)@(n—-k) (3.4.23) 
k=l 
or the AP( <o ) model 
x(n) = E h, (k)x(n—k)+Ga(n) (3.4.24) 
k=l 


If we wish to approximate the PZ(1, 1) model with a finite-order AZ(Q) model, the order Q required to achieve a 
certain accuracy increases as the pole moves closer to the unit circle. Likewise, in the case of an AP(P) approximation, 
better fits to the PZ(P, Q) model require an increased order P as the zero moves closer to the unit circle. 

To determine the autocorrelation, we recall from (3.4.6) that for a causal model 


rn () = -ar,(l—-1)+Gh(-l)+Gd,hd—-1) all / (3.4.25) 
7,(0) = -ar (1)+G+Gd (d, —a,) 
or 7,0) = —ar,(0)+Gd, (3.4.26) 


„(D = -ar,(l-1) 122 
Solving the first two equations for 7,(0) and 7 (1), we obtain 


2 

r,(0)=G l+d; ~2ad, (3.4.27) 

l-a? 
and ni) = GM aa) (3.4.28) 

l-a? 

1 

The normalized autocorrelation is given by 
(d, -a )(l-ad,) 
s as 1% (3.4.29) 
AN Tad -ad 

and p,D=(a)"p,(d-1) 122 (3.4.30) 


Note that given p,(1) and p,(2), we have a nonlinear system of equations that must be solved to obtain a, and 
dı. By using Equations (3.4.20), (3.4.29), and (3.4.30), it can be shown (see Problem 3.28) that the PZ(1, 1) is 
minimum-phase if the ACS satisfies the conditions 


leI < [ed] 
p(2) > PDRP) +1] pd) <0 (3.4.31) 
p(2) > p@[2pq)-1] pl) >0 
which correspond to the admissible region shown in Figure 3.12(b). 


3.4.4 Summary and Dualities 


Table 3.1 summarizes the key properties of all-zero, all-pole, and pole-zero models. These properties help to identify 
models for empirical discrete-time signals. Furthermore, the table shows the duality between AZ and AP models. 
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More specifically, we see that 

1. An invertible AZ(Q ) model is equivalent to an AP(co ) model. Thus, it has a finite-extent autocorrelation and an 
infinite-extent partial autocorrelation. 

2. A stable AP( P ) model is equivalent to an AZ(co) model. Thus, it has an infinite-extent autocorrelation and a 
finite-extent partial autocorrelation. 

3. The autocorrelation of an AZ( Q ) model behaves as the partial autocorrelation of an AP( P ) model, and vice versa. 

4. The spectra of an AP( P ) model and an AZ(Q ) model are related through an inverse relationship. 


Table 3.1. 
Summary of all-pole, all-zero, and pole-zero model properties 
Model PZ(P,Q) 
P 
x(n)+ ¥a,x(n—k) 
P k=l 
Input-output description x(n) + a,x(n—k) = an) x(n) = d,a@(n) + $ d,an—-k) 
k=l k=l 
=d,a@n)+ + doin =k) 
k=l 
System function H(z)=1/ A(z) =d,/1+ ¥a,z"* H(z) = D(z)=d, + ¥d,z* H(z) = D(z)/ A(z) 
k=l k=l 
Recursive representation Finite summation Infinite summation Infinite summation 
Nonrecursive representation | Infinite summation Finite summation Infinite summation 
Stablity conditions Poles inside unit circle Always Poles inside unit circle 
Invertiblity conditions Always Zeros inside unit circle Zeros inside unit circle 
i S Infinite duration (damped 
. Infinite duration (damped| f . ; 
Autocorrelation sequence 3 . Finite duration exponentials and/or sine waves 
exponentials and/or sine waves) 
after Q—P lags) 
Tails off 
: Cuts off : i : 
; : Tails off . 4 Infinite duration (dominated by 
Partial autocorrelation Pp : Infinite duration (damped : y 
Finite duration damped exponentials and/or sine 


exponentials and/or sine waves) 


waves after Q-—P lags) 
Cuts off Tails off Tails off 
Good peak matching Good “notch” matching Good peak and valley matching 





These dualities and properties have been shown and illustrated for low-order models in the previous sections. 


3.5 Summary 


In this chapter we introduced the class of pole-zero signal models and discussed their properties. Each model consists 
of two components: an excitation source and a system. In our treatment, we emphasized that the properties of a signal 
model are shaped by the properties of both components; and we tried, whenever possible, to attribute each property to 
its originator. Thus, for uncorrelated random inputs, which by definition are the excitations for ARMA models, the 
second-order moments of the signal model and its minimum-phase characteristics are completely determined by the 
system. 

We provided a detailed description of the autocorrelation, power spectrum density of all AZ, AP, and PZ models 
for the general case and for first- and second-order models. An understanding of these properties is very important for 
model selection in practical applications. 


Problems 


3.1 Show that a second-order pole p; contributes the term npřu(n) and a third-order pole the terms npřu(n)+ n’ přu(n) to the 
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impulse response of a causal PZ model. The general case is discussed in Oppenheim et al. (1997). 
Consider a zero-mean random sequence x(n) with PSD 


R(e”) = 5+3cos@ 
i 17 +8 cos æ 

(a) Determine the innovations representation of the process x(n) - 

(b) Find the autocorrelation sequence r,(/). 


We want to generate samples of a Gaussian process with autocorrelation ,(/) =(4)!"+(—4)" for all Z. 


(a) Find the difference equation that generates the process x(n) when excited by @(n)~ WGN(0,1)- 
(b) Generate N =1000 samples of the process and estimate the pdf, using the histogram and the normalized autocorrelation 
px(1) using PO [see Section (1.2.1)]. 
(c) Check the validity of the model by plotting on the same graph (i) the true and estimated pdf of y(n) and (ii) the true and 
estimated autocorrelation. 
Compute and compare the autocorrelations of the following processes: 
(a) x,(n) = @(n)+0.3a@(n-1)-0.4@(n-1) and 
(b) x(n) = a@(n)-1.2@(n—-1)-1.6@(n-1) where @n) ~ WGN(0,1)- 
Explain your findings. 
Compute and plot the impulse response and the magnitude response of the systems H(z) and Hy(z) in Example 3.2.1 for 
a=0.7,0.95 and N =8, 16, 64. Investigate how well the all-zero systems approximate the single-pole system. 
Prove Equation (3.2.25) by writing explicitly Equation (3.2.23) and rearranging terms. Then show that the coefficient matrix A 
can be written as the sum of a triangular Toeplitz matrix and a triangular Hankel matrix (recall that a matrix H is Hankel if the 
matrix JHJ” is Toeplitz). 
Use the Yule-Walker equations to determine the autocorrelation and partial autocorrelation coefficients of the following AR models, 
assuming that @(n) ~ WN(0,1)- 
(a) x(n) =0.5x(n—-1)+ a(n). 
(b) x(n) =1.5x(n—1)—0.6x(n—2)+ a(n) - 
What is the variance og? of the resulting process? 
Given the AR process x(n) = x(n—1)—0.5x(n—2) + @(n) , complete the following tasks. 
(a) Determine p, (1). 
(b) Using p,(0) and p,(1), compute {p,(/) }) by the corresponding difference equation. 
(c) Plot ,(/) and use the resulting graph to estimate its period. 
(d) Compare the period obtained in part (c) with the value obtained using the PSD of the model. (Hint: Use the frequency of the 
PSD peak.) 
Given the parameters d),a,,a2, and a; of an AP(3) model, compute its ACS analytically and verify your results, using the 
values in Example 3.2.3 (Hint: Use Cramer’s rule.) 
Consider the following AP(3) model: x(n) =0.98x(n—3)+@(n), where @n) ~ WGN(0,1)- 
(a) Plot the PSD of x(n) and check if the obtained process is going to exhibit a pseudoperiodic behavior. 
(b) Generate and plot 100 samples of the process. Does the graph support the conclusion of part (a)? If yes, what is the period? 
(c) Compute and plot the PSD of the process y(n) = +[x(n—-1) + x(n) + x(n +1)]- 


(d) Repeat part (b) and explain the difference between the behavior of processes x(n) and y(n). 
Consider the following AR(2) models: (i) x(n) =0.6x(n—1)+0.3x(n—2)+ @(n) and (ii) x(n) =0.8x(n—-1)—0.5x(n-2)+ 
@n), where @(n)~ WGN(0,1)- 
(a) Find the general expression for the normalized autocorrelation sequence p(l), and determine co. 
(b) Plot {p(1)}ọ and check if the models exhibit pseudoperiodic behavior. 
(c) Justify your answer in part (b) by plotting the PSD of the two models. 
(a) Derive the formulas that express the PACS of an AP(3) model in terms of its ACS, using the Yule-Walker equations and 
Cramer’s rule. 
(b) Use the obtained formulas to compute the PACS of the AP(3) model in Example 3.2.3. 
(c) Check the results in part (b) by recomputing the PACS, using the algorithm of Levinson-Durbin. 
Show that the spectrum of any PZ model with real coefficients has zero slope at @=0 and œ=. 
Derive Equations (3.2.71) describing the minimum-phase region of the AP( 2 ) model, starting from the conditions 
(a) | pi|<1, | p2|<1 and 
(b) |ki |<1, |ko|<1- 
(a) Show that the spectrum of an AP(2) model with real poles can be obtained by the cascade connection of two AP(1) models with 
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real coefficients. 
(b) Compute and plot the impulse response, ACS, PACS, and spectrum of the AP models with p,=0.6, p.=-—0.9, and 
Pi=pr.=0.9. 
Prove Equation (3.2.89) and demonstrate its validity by plotting the spectrum (3.2.88) for various values of r and @. 
Prove that if the AP(P) model A(z) is minimum-phase, then 
1 1 
pte [ACP 
(a) Prove Equations (3.2.101) and (3.2.102) and recreate the plot in Figure 3.8(a). 
(b) Determine and plot the regions corresponding to complex and real poles in the autocorrelation domain by recreating Figure 
3.8(b). 
Consider an AR(2) process x(n) with do =1, a =—1.6454 a, =0.9025, and @n)~ WGN(0,1). 
(a) Generate 100 samples of the process and use them to estimate the ACS pAl ), using Equation (1.2.1). 
(b) Plot and compare the estimated and theoretical ACS values for 0 < / < 10. 
(c) Use the estimated values of 6 (J) and the Yule-Walker equations to estimate the parameters of the model. Compare the 
estimated with the true values, and comment on the accuracy of the approach. 
(d) Use the estimated parameters to compute the PSD of the process. Plot and compare the estimated and true PSDs of the process. 
(e) Compute and compare the estimated with the true PACS. 
Find a minimum-phase model with autocorrelation p(0) = 1, p(+1) = 0.25,and p(l) = 0 for |/| 2 2. 
Consider the MA(2) model x(n) = @(n)—0.1la@(n-1)+0.2@(n—-2)- 
(a) Is the process x(n) stationary? Why? 
(b) Is the model minimum-phase? Why? 
(c) Determine the autocorrelation and partial autocorrelation of the process. 
Consider the following ARMA models: (i) x(n) = 0.6x(n—1)+ @(n)—0.9@(n—-1) and (ii) x(n) =1.4x(n—1)—0.6x(n—2) + 
a@n)—-0.8a(n—-1)- 
(a) Find a general expression for the autocorrelation p(l). 
(b) Compute the partial autocorrelation k i for m=1, 2, 3. 
(c) Generate 100 samples from each process, and use them to estimate {(/)}5° using Equation (1.2.1). 
(d) Use A(1) toestimate {K,,}?°. 
(e) Plot and compare the estimates with the theoretically obtained values. 
Determine the coefficients of a PZ(2,1) model with autocorrelation values 7,(0)=19, n,(1)=9, n,(2)=—5,and 7,(3)=—7. 
(a) Show that the impulse response of an AZ(Q) model can be recovered from its response h(n) to a periodic train with period L if 
L>@Q. 
(b) Show that the ACS of an AZ(Q) model can be recovered from the ACS or spectrum of h(n) if L > 20+1. 
Prove Equation (3.3.17) and illustrate its validity by computing the PACS of the model H(z) =1-0.8z7'. 
Prove Equations (3.3.24) that describe the minimum-phase region of the AZ(2) model. 
Consider an AZ(2) model with dọ =2 and zeros z,, =0.95e*!”?. 
(a) Compute and plot N =100 output samples by exciting the model with the process @(n) ~ WGN(0,1) - 
(b) Compute and plot the ACS, PACS, and spectrum of the model. 
(c) Repeat parts (a) and (b) by assuming that we have an AP(2) model with poles at p> =0.95e*!”?. 
(d) Investigate the duality between the ACS and PACS of the two models. 


Prove Equations (3.4.31) and use them to reproduce the plot shown in Figure 3.12(b). Indicate which equation corresponds to each 
curve. 


Determine the spectral flatness measure of the following processes: 
(a) x(n) =ax(n—-1)+ax(n—2)+@(n) and 
(b) x(n) = @(n)+ba(n—-1)+b@(n—2), where œ(n) is a white noise sequence. 
Consider a zero-mean wide-sense stationary (WSS) process x(n) with PSD R,(e’”) andan MxM correlation matrix with 
eigenvalues { A, }/" . Szegé’s theorem (Grenander and Szegé 1958) states that if g(-) is a continuous function, then 
im SA) + 8A) t+ 8Ay) _ 1 
M =e M 27 





[ 8IR, (edw 


Using this theorem, show that 


: IM _ 1 jo 
lim (det R.) =exp {2 f In[R,(e’°)] ao 
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3.31 


3.32 


3.33 


Consider two linear random processes with system functions 
(i) A(z)=1-0.81z'-0427M1-z'y and (ii) H(z)=1-0.5z7"/1- z" 
(a) Find a difference equation that leads to a numerically stable simulation of each process. 
(b) Generate and plot 100 samples from each process, and look for indications of nonstationarity in the obtained records. 
(c) Compute and plot the second difference of (i) and the first difference of (ii). Comment about the stationarity of the obtained 
records. 
Generate and plot 100 samples for each of the linear processes with system functions 
1 
(1-z™)(1—0.9z™') 


1-0.5z" 
(l-z")(-0.92") 
and then estimate and examine the values of the ACS {(7)}ọ andthe PACS { }j°. 
Consider the process y(n) = do +dın +dzn? + x(n), where x(n) isa stationary process with known autocorrelation r, (l). 
(a) Show that the process y® (n) obtained by passing y(n) through the filter H(z)=(1—z"')’ is stationary. 
(b) Express the autocorrelation r®(/) of y® (n) in terms of r,(l). Note: This process is used in practice to remove quadratic 
trends from data before further analysis. 


(a) H(z)= 


(©) H(z)= 
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CHAPTER 4 


Nonparametric Power Spectrum Estimation 


The essence of frequency analysis is the representation of a signal as a superposition of sinusoidal components. In 
theory, the exact form of this decomposition (spectrum) depends on the assumed signal model. In Chapters 2 we 
discussed the mathematical tools required to define and compute the spectrum of signals described by stochastic 
models. In practical applications, where only a finite segment of a signal is available, we cannot obtain a complete 
description of the adopted signal model. Therefore, we can only compute an approximation (estimate) of the 
spectrum of the adopted signal model (“true” or theoretical spectrum). The quality of the estimated spectrum depends 
on 

* How well the assumed signal model represents the data. 

e What values we assign to the unavailable signal samples. 

e Which spectrum estimation method we use. 

Clearly, meaningful application of spectrum estimation in practical problems requires sufficient a priori 
information, understanding of the signal generation process, knowledge of theoretical concepts, and experience. 

In this chapter we discuss the most widely used correlation and spectrum estimation methods, as well as their 
properties, implementation, and application to practical problems. We discuss only nonparametric techniques that do 
not assume a particular functional form, but allow the form of the estimator to be determined entirely by the data. 
These methods are based on the discrete Fourier transform of either the signal segment or its autocorrelation sequence. 
In contrast, parametric methods assume that the available signal segment has been generated by a specific parametric 
model (e.g., a pole-zero or harmonic model). Since the choice of an inappropriate signal model will lead to erroneous 
results, the successful application of parametric techniques, without sufficient a priori information, is very difficult in 
practice. These methods are discussed in Chapter 8. 

We begin this chapter with an introductory discussion on the purpose of, and the DSP approach to, spectrum 
estimation. We explore various errors involved in the estimation of finite-length data records (i.e., based on partial 
information). Section 4.3 is the main section of this chapter in which we discuss various nonparametric approaches to 
the power spectrum estimation of stationary random signals. The computation of auto and cross-spectra using 
Thomson’s multiple windows (or multitapers) is discussed in Section 4.5. Finally, in Section 4.6 we summarize 
important topics and concepts from this chapter. A classification of the various spectral estimation methods that are 
discussed in this book is provided in Figure 4.1. 


4.1 Spectral Analysis of Deterministic Signals 


If we adopt a deterministic signal model, the mathematical tools for spectral analysis are the Fourier series and the 
Fourier transforms. It should be stressed at this point that applying any of these tools requires that the signal values in 
the entire time interval from —oo to +00 be available. If it is known a priori that a signal is periodic, then only one 
period is needed. The rationale for defining and studying various spectra for deterministic signals is threefold. First, 
we note that every realization (or sample function) of a stochastic process is a deterministic function. Thus we can use 
the Fourier series and transforms to compute a spectrum for stationary processes. Second, deterministic functions and 
sequences are used in many aspects of the study of stationary processes, for example, the autocorrelation sequence, 
which is a deterministic sequence. Third, the various spectra that can be defined for deterministic signals can be used 
to summarize important features of stationary processes. 
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7 (Section 4.5) 
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FIGURE 4.1 
Classification of various spectrum estimation methods. 


Most practical applications of spectrum estimation involve continuous-time signals. For example, in speech 
analysis we use spectrum estimation to determine the pitch of the glottal excitation and the formants of the vocal tract 
(Rabiner and Schafer 1978). In electroencephalography, we use spectrum estimation to study sleep disorders and the 
effect of medication on the functioning of the brain (Duffy, Iyer, and Surwillo 1989). Another application is in 
Doppler radar, where the frequency shift between the transmitted and the received waveform is used to determine the 
radial velocity of the target (Levanon 1988). 

The numerical computation of the spectrum of a continuous-time signal involves three steps: 

1. Sampling the continuous-time signal to obtain a sequence of samples. 

2. Collecting a finite number of contiguous samples (data segment or block) to use for the computation of the 
spectrum. This operation, which usually includes weighting of the signal samples, is known as windowing, or 
tapering. 

3. Computing the values of the spectrum at the desired set of frequencies. This step is usually implemented using 
some efficient implementation of the DFT. 

The above processing steps, which are necessary for DFT-based spectrum estimation, are shown in Figure 4.2. 
The continuous-time signal is first processed through a low-pass (antialiasing) filter and then sampled to obtain a 
discrete-time signal. Data samples of frame length N with frame overlap No are selected and then conditioned using 
a window. Finally, a suitable-length DFT of the windowed data is taken as an estimate of its spectrum, which is then 
analyzed. In this section, we discuss in detail the effects of each of these operations on the accuracy of the computed 
spectrum. The understanding of the implications of these effects is very important in all practical applications of 
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spectrum estimation. 






Low-pass filter 
H,,(F) 


FIGURE 4.2 
DFT-based Fourier analysis system for continuous-time signals. 


4.1.1 Effect of Signal Sampling 


The continuous-time signal x,(f), whose spectrum we seek to estimate, is first passed through a low-pass filter, also 
known as an antialiasing filter H),(F'), in order to minimize the aliasing error after sampling. The antialiased signal 
x(t) is then sampled through an analog-to-digital converter’ (ADC) to produce the discrete-time sequence x(n), 
that is, 


x(n) = x, (t) lene, (4.1.1) 
From the sampling theorem, we have 
X (e) = FY X. (F -IF,) (4.1.2) 
l=—% 


where X.(F)=Hp(F)X.(F). We note that the spectrum of the discrete-time signal x(n) is a periodic 
replication of X.(F). Overlapping of the replicas X.(F —1F,) results in aliasing. Since any practical antialiasing 
filter does not have infinite attenuation in the stopband, some nonzero overlap of frequencies higher than F,/2 
should be expected within the band of frequencies of interest in x(n). These aliased frequencies give rise to the 
aliasing error, which, in any practical signal, is unavoidable. It can be made negligible by a properly designed 
antialiasing filter H\,(F). 


4.1.2 Windowing, Periodic Extension, and Extrapolation 


In practice, we compute the spectrum of a signal by using a finite-duration segment. The reason is threefold: 

1. The spectral composition of the signal changes with time. or 

2. We have only a finite set of data at our disposal. or 

3. We wish to keep the computational complexity to an acceptable level. 

Therefore, it is necessary to partition x(n) into blocks (or frames) of data prior to processing. This operation is 
called frame blocking, and it is characterized by two parameters: the length of frame N and the overlap between 
frames No (see Figure 4.2). Therefore, the central problem in practical frequency analysis can be stated as follows: 

Determine the spectrum of a signal x(n),—00 < n < œ, from its values in a finite interval 0 < n < N —1, that is, 
from a finite-duration segment. 

Since x(n) is unknown for n<0 and n= N , we cannot say, without having sufficient a priori information, 
whether the signal is periodic or aperiodic. If we can reasonably assume that the signal is periodic with fundamental 
period N , we can easily determine its spectrum by computing its Fourier series, using the DFT. 

However, in most practical applications, we cannot make this assumption because the available block of data 
could be either part of the period of a periodic signal or a segment from an aperiodic signal. In such cases, the 
spectrum of the signal cannot be determined without assigning values to the signal samples outside the available 
interval. There are three ways to deal with this issue: 

1. Periodic extension. We assume that x(n) is periodic with period N, that is, x(n)=x(n+ N) forall n, and 
we compute its Fourier series, using the DFT. 


We will ignore the quantization of discrete-time signals. 
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2. Windowing. We assume that the signal is zero outside the interval of observation, that is, x(n)=0 for n<0 
and nè N . This is equivalent to multiplying the signal with the rectangular window 


noel O0<n<N-1 on 


0 elsewhere 


The resulting sequence is aperiodic, and its spectrum is obtained by the discrete-time Fourier transform (DTFT). 
3. Extrapolation. We use a priori information about the signal to extrapolate (i.e., determine its values for n <0 and 
n2 N) outside the available interval and then determine its spectrum by using the DTFT. 

Periodic extension and windowing can be considered the simplest forms of extrapolation. It should be obvious 
that a successful extrapolation results in better spectrum estimates than periodic extension or windowing. Periodic 
extension is a straightforward application of the DFT, whereas extrapolation requires some form of a sophisticated 
signal model. As we shall see, most of the signal modeling techniques discussed in this book result in some kind of 
extrapolation. We first discuss, in the next section, the effect of spectrum sampling as imposed by the application of 
DFT (and its side effect—the periodic extension) before we provide a detailed analysis of the effect of windowing. 


4.1.3 Effect of Spectrum Sampling 


In many real-time spectrum analyzers, as illustrated in Figure 4.2, the spectrum is computed (after signal conditioning) 
by using the DFT. The computation samples the continuous spectrum at equispaced frequencies. Theoretically, if the 
number of DFT samples is greater than or equal to the frame length N , then the exact continuous spectrum (based 
on the given frame) can be obtained by using the frequency-domain reconstruction (Oppenheim and Schafer 1989; 
Proakis and Manolakis 1996). This reconstruction, which requires a periodic sinc function [defined in (4.1.9)], is not 
a practical function to implement, especially in real-time applications. Hence a simple linear interpolation is used for 
plotting or display purposes. This linear interpolation can lead to misleading results even though the computed DFT 
sample values are correct. It is possible that there may not be a DFT sample precisely at a frequency where a peak of 
the DTFT is located. In other words, the DFT spectrum misses this peak, and the resulting linearly interpolated 
spectrum provides the wrong location and height of the DTFT spectrum peak. This error can be made smaller by 
sampling the DTFT spectrum at a finer grid, that is, by increasing the size of the DFT. The denser spectrum sampling 
is implemented by an operation called zero padding and is discussed later in this section. 
Another effect of the application of DFT for spectrum calculations is the periodic extension of the sequence in 

the time domain. It follows that the XN -point DFT 

N-I 

X=) xaer"" (4.1.4) 

n=0 
is periodic with period N . This should be expected given the relationship of the DFT to the Fourier transform or the 
Fourier series of discrete-time signals, which are periodic in œ with period 27 . A careful look at the inverse DFT 


x(n) = 1s X (kev (4.1.5) 
k=0 


reveals that x(n) is also periodic with period XN . This is a somewhat surprising result since no assumption about 
the signal x(n) outside the interval 0<n< N-—1 has been made. However, this periodicity in the time domain 
can be easily justified by recalling that sampling in the time domain results in a periodicity in the frequency domain, 
and vice versa. 

To understand these effects of spectrum sampling, consider the following example in which a continuous-time 
sinusoidal signal is sampled and then is truncated by a rectangular window before its DFT is performed. 


EXAMPLE 4.1.1. A continuous-time signal x.(¢)=2cos2nt is sampled with a sampling frequency of Æ =1/T =10 samples 
per second, to obtain the sequence X(”). It is windowed by an N -point rectangular window Wr(”) to obtain the sequence 
x(n). Determine and plot |X n(k)|, the magnitude of the DFT of xv(”), for (4) N=10 and (b) N=15. Comment on 
the shapes of these plots. 

Solution. The discrete-time signal x(n) is asampled version of x,(¢) and is given by 


s= t=T = Deve — 20080 Sen T=0.1s 


s 


Then, x(n) isa periodic sequence with fundamental period N =10. 
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a. For N =10, we obtain xy(n)=2cos0.42n,0<n<9, which contains one period of x(n). The periodic extension of 
xy(n) and the magnitude plot of its DFT are shown in the top row of Figure 4.3. For comparison, the DTFT Xy (e!”) of 
xy(n) is also superimposed on the DFT samples. We observe that the DFT has only two nonzero samples, which together 
constitute the correct frequency of the analog signal x,(¢). The DTFT has a mainlobe and several sidelobes due to the 
windowing effect. However, the DFT samples the sidelobes at their zero values, as illustrated in the DFT plot. Another 
explanation for this behavior is that since the samples in x,(n) for N =10 constitute one full period of cos0.47n, the 
10-point periodic extension of x,(m), shown in the top left graph of Figure 4.3, results in the original sinusoidal sequences 
x(n) . Thus what the DFT “sees” is the exact sampled signal x(t). In this case, the choice of N is a desirable one. 

b. For N=15, we obtain xy(n)=2cos0.4an, O<n<14, which contains 1} periods of x(n). The 


periodic extension of x,(m) and the magnitude plot of its DFT are shown in the bottom row of Figure 4.3. Once again for 
comparison, the DTFT Xy(e!”) of xy(n) is superimposed on the DFT samples. In this case, the DFT plot looks markedly 
different from that for M =10 although the DTFT plot appears to be similar. In this case, the DFT does not sample two peaks 
at the exact frequencies; hence if the resulting DFT samples are joined by the linear interpolation, then we will get a misleading 
result. Since the sequence x,(m) does not contain full periods of cos0.42n, the periodic extension of x,(m) contains 
discontinuities at n=/N, 1=0, +1, +2, ---,as shown in the bottom left graph of Figure 4.3. This discontinuity results in 
higher-order harmonics in the DFT values. The DTFT plot also has mainlobes and sidelobes, but the DFT samples these 
sidelobes at nonzero values. Therefore, the length of the window is an important consideration in spectrum estimation. The 
sidelobes are the source of the problem of leakage that gives rise to bias in the spectral values, as we will see in the following 
section. The suppression of the sidelobes is controlled by the window shape, which is another important consideration in 
spectrum estimation. 


8-point periodic extension 8-point DFT 


MUNI) ff 





w 
ke] A 
2 = 
Š z 
£ 
< 
x, 0 
-8 0 8 16 -0.5 —0.25 0 0.25 0.5 
n Normalized frequency 
Zpoint periodic extension Fpoint DFT 
7 
T 
Š S 
= 0 = 
D 
Nn 
-2 0 
—7 0 7 14 -0.5 0.25 0 0.25 0.5 
n Normalized frequency 


FIGURE 4.3 
Effect of window length L on the DFT spectrum shape. 


A quantitative description of the above interpretations and arguments related to the capacities and limitations of 
the DFT is offered by the following result (see Proakis and Manolakis 1996). 
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THEOREM 4.1 (DFT SAMPLING THEOREM)]. Let x(t) —00<t<0oo, be a continuous-time signal with Fourier 
transform X.(F), -œ< F <œ. Then, the N-point sequences {Tx (n), 0<n<N-—1} and {X¥,(k),0Sk<N-l} form an 
N-point DFT pair, that is, 
x(n) 4 5" x,(nT -mNT)N€—_> & SEY X [i a, | (4.1.6) 
p —, c N P nii c N s 
where F, =1/T is the sampling frequency. 
Proof. The proof is explored in Problem 4.1. 


Thus, given a continuous-time signal x,(r) and its spectrum Y.(F), we can create a DFT pair by sampling and 
aliasing in the time and frequency domains. Obviously, this DFT pair provides a “faithful” description of x.(t) and 
X.(F) if both the time-domain aliasing and the frequency-domain aliasing are insignificant. The meaning of relation 
(4.1.6) is graphically illustrated in Figure 4.4. In this figure, we show the time-domain signals in the left column and 
their Fourier transforms in the right column. The top row contains continuous-time signals, which are shown as 
nonperiodic and of infinite extent in both domains, since many real-world signals exhibit this behavior. The middle 
row contains the sampled version of the continuous-time signal and its periodic Fourier transform (the nonperiodic 
transform is shown as a dashed curve). Clearly, aliasing in the frequency domain is evident. Finally, the bottom row 
shows the sampled (periodic) Fourier transform and its correponding time-domain periodic sequence. Again, aliasing 
in the time domain should be expected. Thus we have sampled and periodic signals in both domains with the 
certainty of aliasing one domain and the possibility in both domains. This figure should be recalled any time we use 
the DFT for the analysis of sampled signals. 
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Graphical illustration of the DFT sampling theorem. 
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Zero padding 


The N -point DFT values of an N -point sequence x(n) are samples of the DIFT X(e!”). These samples 
can be used to reconstruct the DIFT X(e’”) by using the periodic sinc interpolating function. Alternatively, one 
can obtain more (i.e., dense) samples of the DTFT by computing a larger Nprr -point DFT of x(n), where 
Nrrr >> N . Since the number of samples of x(n) is fixed, the only way we can treat x(n) as an Nprr -point 
sequence is by appending Nrrr—N zeros to it. This procedure is called the zero padding operation, and it is used 
for many purposes including the augmentation of the sequence length so that a power-of-2 FFT algorithm can be used. 
In spectrum estimation, zero padding is primarily used to provide a better-looking plot of the spectrum of a 
finite-length sequence. This is shown in Figure 4.5 where the magnitude of an Nprr -point DFT of the eight-point 
sequence x(n)=cos(21n/4) is plotted for Nrrr =8, 16, 32, and 64.The DTFT magnitude | X(e'”)| is 
also shown for comparison. It can be seen that as more zeros are appended (by increasing Nprr), the resulting 
larger-point DFT provides more closely spaced samples of the DTFT, thus giving a better-looking plot. Note, 
however, that the zero padding does not increase the resolution of the spectrum; that is, there are no new peaks and 
valleys in the display, just a better display of the available information. This type of plot is called a high-density 
spectrum. For a high-resolution spectrum, we have to collect more information by increasing N . The DTFT plots 
shown in Figures 4.3 and 4.5 were obtained by using a very large amount of zero padding. 
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FIGURE 4.5 
Effect of zero padding. 


4.1.4 Effects of Windowing: Leakage and Loss of Resolution 


To see the effect of the window on the spectrum of an arbitrary deterministic signal x(n), defined over the 
entire range —o < n < œ , we notice that the available data record can be expressed as 

x(n) = x(n)w,(n) (4.1.7) 

where we(n) is the rectangular window defined in (4.1.3). Thus, a finite segment of the signal can be thought of as 


a product of the actual signal x(n) and a data window w(n). In (4.1.7), w(n)=we(n), but w(n) can be any 
arbitrary finite-duration sequence. The Fourier transform of x,(n) is 


X, (°) = X(e”) @W(e”) Ê = [xe wel) a0 (4.1.8) 
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that is, Xy(e’”) equals the periodic convolution of the actual Fourier transform with the Fourier transform 
W(e!”) of the data window. For the rectangular window, W(e’”) = Wg (e°), where 


sin (@N/2) 


W. (e!”) = “jo(N-AI2 £ A (e 2N? (4.1.9 


The function A(q@) isa periodic function in @ with fundamental period equal to 27 and is called a periodic sinc 
function. Figure 4.6 shows three periods of A(w) for N=11. We note that Wa(e’) consists of a mainlobe 
(ML). 


we”) fale = 
Wy (ei) = i w (4.1.10) 
0 —<lo|sa 
N 


and the sidelobes We, (e!”) = Wr (e!”) — Wu (e°) . Thus, (4.1.8) can be written as 
X,(e”) = X(e!”) @W,,, (e) + X (e) @W,, (e) (4.1.11) 


The first convolution in (4.1.11) smoothes rapid variations and suppresses narrow peaks in X(e!”), whereas 
the second convolution introduces ripples in smooth regions of X (e°) and can create “false” peaks. Therefore, the 
spectrum we observe is the convolution of the actual spectrum with the Fourier transform of the data window. The 
only way to improve the estimate is to increase the window length N or to choose another window shape. For the 
rectangular window, increasing N results in a narrower mainlobe, and the distortion is reduced. As 
N — œ, Wr (e°) tends to an impulse train with period 27 and Xy(e'”) tendsto X(e!”), as expected. Since in 
practice the value of M is always finite, the only way to improve the estimate Xjy(e’”) is by properly choosing 
the shape of the window w(n). The only restriction on w(n) is that it be of finite duration. 


A(w) 
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FIGURE 4.6 
Plot of A(@)=sin(@N/2)/sin(@/2) for N=11. 


It is known that any time-limited sequence w(n) has a Fourier transform W(e'”) that is nonzero except at a 
finite number of frequencies. Thus, from (4.1.8) we see that the estimated value Xy(e!”) is computed by using all 
values of X(e”) weighted by W(e”). The contribution of the sinusoidal components with frequencies 
@#@ to the value Xy(e!”) introduces an error known as leakage. As the name suggests, energy from one 
frequency range “leaks” into another, giving the wrong impression of stronger or weaker frequency components. 

To illustrate the effect of the window shape and duration on the estimated spectrum, consider the signal 

x(n) = cos 0.352n+ cos 0.42n+ 0.25cos0.82n (4.1.12) 


which has a line spectrum with lines at frequencies œ = 0.357, =0.4z, and œ =0.8z. This line spectrum 
(normalized so that the magnitude is between 0 and 1) is shown in the top graph of Figure 4.7 over 0< @ <7. The 
spectrum Xvy(e!”) of xy(n) using the rectangular window is given by 


. 1 ` . z ; 
lja = jora) j(o-a) j(o+a) j(@-@,) 
X,(e'")= zre )+W (e )+W (e )+W (e ) (4.1.13) 


+ 0.25W (eh ) + 0.25W(e?™)] 
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FIGURE 4.7 
Spectrum of three sinusoids using rectangular and Hamming windows. 


The second and the third plots in Figure 4.7 show 2048-point DFTs of xy(n) for a rectangular data window with 
N=21 and N =81. We note that the ability to pick out peaks (resolvability) depends on the duration N —1 of 
the data window.” To resolve two spectral lines at @=a@ and @=@ using a rectangular window, we should 
have the difference |@-—q@,| greater than the mainlobe width Aq, which is approximately equal to 27/(N-1), 
in radians per sampling interval, from the plot of A(q@) in Figure 4.6, that is, 


2n or Ns A y] 


N-1 |@, —@, | 

For a rectangular window of length N , the exact value of Aw is equal to 1.812/(N—1).If N is too small, the 
two peaks at @=0.35a and w=0.4m are fused into one, as shown in the N=21 plot. When N =81, the 
corresponding plot shows a resolvable separation; however, the peaks have shifted somewhat from their true 
locations. This is called bias, and it is a direct result of the leakage from sidelobes. In both cases, the peak at 
@=0.82 can be distinguished easily (but also has a bias). 

Another important observation is that the sidelobes of the data window introduce false peaks. For a rectangular 
window, the peak sidelobe level is 13 dB below zero, which is not a good attenuation. Thus these false peaks have 
values that are comparable to that of the true peak at œ= 0.8m, as shown in Figure 4.7. These peaks can be 
minimized by reducing the amplitudes of the sidelobes. The rectangular window cannot help in this regard because of 
Gibb’s well-known phenomenon associated with it. We need a different window shape. However, any window other 
than the rectangular window has a wider mainlobe; hence this reduction can be achieved only at the expense of the 
resolution. To illustrate this, consider the Hamming (Hm) data window, given by 





| @,—@, |> Ao = 


0.54-0.46cos2an/N-1 O<n<N-1 
nt) =| TE 7 (4.1.14) 


0 otherwise 


2s z z : 5 p 
Since there are N samples in a data window, the number of intervals or durations is N— 1. 
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with the approximate width of the mainlobe equal to 82/(N—1) and the exact mainlobe width equal to 
6.27n/(N —1) . The peak sidelobe level is 43 dB below zero, which is considerably better than that of the rectangular 
window. The Hamming window is obtained by using the hamming (N) function in MATLAB. 

The bottom plot in Figure 4.7 shows the 2048-point DFT of the signal xy(n) for a Hamming window with 
N =81. Now the peak at @=0.87 is more prominent than before, and the sidelobes are almost suppressed. Note 
also that since the mainlobe width of the Hamming window is wider, the peaks have a wider base—so much so that 
the first two frequencies are barely recognized. We can correct this problem by choosing a larger window length. This 
interplay between the shape and the duration of a window function is one of the important issues and, as we will see 
in Section 4.3, produces similar effects in the spectral analysis of random signals. 


Some useful windows 


The design of windows for spectral analysis applications has drawn a lot of attention and is examined in detail in 
Harris (1978). We have already discussed two windows, namely, the rectangular and the Hamming window. Another 
useful window in spectrum analysis is due to Hann and is mistakenly known as the Hanning window. There are 
several such windows with varying degrees of tradeoff between resolution (mainlobe width) and leakage (peak 
sidelobe level). These windows are known as fixed windows since each provides a fixed amount of leakage that is 
independent of the length N . Unlike fixed windows, there are windows that contain a design parameter that can be 
used to trade between resolution and leakage. Two such windows are the Kaiser window and the Dolph-Chebyshev 
window, which are widely used in spectrum estimation. Figure 4.8 shows the time-domain window functions and * 
their corresponding frequency-domain log-magnitude plots in decibels for these five windows. The important 
properties such as peak sidelobe level and mainlobe width of these windows are compared in Table 4.1. 

















Table 4.1 
Comparison of properties of commonly used windows. Each window is assumed to be of length N. 
Window Peak sidelobeApproximate Exact 
type level (dB) mainlobe width mainlode width 
R i = 4n 1.817 
ectangular N-I N-I 
Hanni 39 8x 5.01n 
SE N-1 N-1 
i ; 43 81 6.271% 
amming NLI NI 
i eg _A-8 
jiii 2.285N -1 
cosh 10”? \” 
Dolph-Chebyshev —A — cos” [cosh ae | 


Hanning window. This window is given by the function 


0.5-0.5cos2zn/N -1 0<n<N-I 


Wwa (7) -| (4.1.15) 


0 otherwise 


which is a raised cosine function. The peak sidelobe level is 32 dB below zero, and the approximate mainlobe width 


is 8n/(N —1) while the exact mainlobe width is 5.012/(N—1). In MATLAB this window function is obtained 
through the function hanning(N). 
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Time-domain window functions and their frequency-domain characteristics for rectangular, Hanning, Hamming, Kaiser, and 
Dolph-Chebyshev windows. 


Kaiser window. This window function is due to J. F. Kaiser and is given by 


1,{By1-(1-2n/(n -D7 | 


Wia) = (2) 


0 otherwise 


<n <N- (4.1.16) 


where J(-) is the modified zero-order Bessel function of the first kind and 8 is a window shape parameter that 
can be chosen to obtain various peak sidelobe levels and the corresponding mainlobe widths. Clearly, #=0 results 
in the rectangular window while {> 0 results in lower sidelobe leakage at the expense of a wider mainlobe. Kaiser 
has developed approximate design equations for . Given a peak sidelobe level of A dB below the peak value, the 
approximate value of is given by 
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0 A<2l 
B =30.5842(A—21)"4 +0.07886(4—21) 21 < A < 50 (4.1.17) 
0.1102(4-8.7) A > 50 


Furthermore, to achieve the given values of the peak sidelobe level of A and the mainlobe width Aq, the length N 
must satisfy 
A-8 
o=————_ 
2.285(N —1) 
In MATLAB this window is given by the function Kaiser(N, beta). 


Dolph-Chebyshev window. This window is characterized by the property that the peak sidelobe levels are 
constant; that is, it has an “equiripple” behavior. The window wpc(n) is obtained as the inverse DFT of the 
Chebyshev polynomial evaluated at N equally spaced frequencies around the unit circle. The details of this 
window function computation are available in Harris (1978). The parameters of the Dolph-Chebyshev window are 
the constant sidelobe level A in decibels, the window length N , and the mainlobe width Aw. However, only two 
of the three parameters can be independently specified. In spectrum estimation, parameters N and A are 
generally specified. Then Aq is given by 


-1 44/20 \~! 
A0 = cos" (aea | (4.1.19) 


(4.1.18) 


N-1 


In MATLAB this window is obtained through the function chebwin (N, A). 

To illustrate the usefulness of these windows, consider the same signal containing three frequencies given in 
(4.1.12). Figure 4.9 shows the spectrum of x,(m) using the Hanning, Kaiser, and Chebyshev windows for length 
N =81. The Kaiser and Chebyshev window parameters are adjusted so that the peak sidelobe level is 40 dB or 
below. Clearly, these windows have suppressed sidelobes considerably compared to that of the rectangular window 
but the main peaks are wider with negligible bias. The two peaks in the Hanning window spectrum are barely 
resolved because the mainlobe width of this window is much wider than that of the rectangular window. The 
Chebyshev window spectrum has uniform sidelobes while the Kaiser window spectrum shows decreasing sidelobes 
away from the mainlobes. 
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FIGURE 4.9 
Spectrum of three sinusoids using Hanning, Kaiser, and Chebyshev windows. 


4.1.5 Summary 


In conclusion, the frequency analysis of deterministic signals requires a careful study of three important steps. First, 
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the continuous-time signal x(t) is sampled to obtain samples x(”) that are collected into blocks or frames. The 
frames are “conditioned” to minimize certain errors by multiplying by a window sequence W(”) of length N. 
Finally the windowed frames Xy(”) are transformed to the frequency domain using the DFT. The resulting DFT 
spectrum X x(k) is a faithful replica of the actual spectrum X.(F’) if the following errors are sufficiently small. 

Aliasing error. This is an error due to the sampling operation. If the sampling rate is sufficiently high and if the 
antialiasing filter is properly designed so that most of the frequencies of interest are represented in x(n), then this 
error can be made smaller. However, a certain amount of aliasing should be expected. 

Errors due to finite-length window. There are several errors such as resolution loss, bias, and leakage that are 
attributed to the windowing operation. Therefore, a careful design of the window function and its length is necessary 
to minimize these errors. These topics were discussed in Section 4.1.4. In Table 4.1 we summarize key properties of 
five windows discussed in this section that are useful for spectrum estimation. 

Spectrum reconstruction error. The DFT spectrum X (4) is a number sequence that must be reconstructed 
into a continuous function for the purpose of plotting. A practical choice for this reconstruction is the first-order 
polynomial interpolation. This reconstruction error can be made smaller (and in fact comparable to the screen 
resolution) by choosing a large number of frequency samples, which can be achieved by the zero padding operation 
in the DFT. It was discussed in Section 4.1.3. 

With the understanding of frequency analysis concepts developed in this section, we are now ready to tackle the 
problem of spectral analysis of stationary random signals. From Chapter 2 we recognize that the true spectral values 
can only be obtained as estimates. This requires some understanding of key concepts from estimation theory, which is 
developed in Section 2.4. 


4.2 Estimation of the Autocorrelation of Stationary 
Random Signals 


The second-order moments of a stationary random sequence—that is, the mean value 44, the autocorrelation 
sequence r,(/), and the PSD R,(e’”)—play a crucial role in signal analysis and signal modeling. In this section, 
we discuss the estimation of the autocorrelation sequence r,(/) using a finite data record {x(n)})"' of the process. 

For a stationary process x(n) , the most widely used estimator of 7,(/) is given by the sample autocorrelation 


sequence 
N-I-| 


<- $ x(n+1)x"(n) 


Pery 0</<N-1 
Pl) #4 FD -W-) ef<0 (4.2.1) 
0 elsewhere 
or, equivalently, 
N-1 
LEa- 0</1<N-1 
n=l 
PD) = 4 F(-l) -(N-1) <1 <0 (4.2.2) 


0 elsewhere 


which is a random sequence. Note that without further information beyond the observed data {x(n)})"', it is not 
possible to provide reasonable estimates of r,(/) for |/|=N. Even for lag values |/| close to N, the correlation 
estimates are unreliable since very few x(n+|/|)x(n) pairs are used. A good rule of thumb provided by Box and 
Jenkins (1976) is that N should be at least 50 and that |/|< N/4. The sample autocorrelation 7,(/) given in 
(4.2.1) has a desirable property that foreach 7 >1, the sample autocorrelation matrix 
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7,(0) Poy RUD 
ae A) FO) i py —2) (4.2.3) 
FAN-l) #(N-2) = Px(0) 


is nonnegative definite (see Section 2.3.1). This property is explored in Problem 4.5. MATLAR provides functions to 
compute the correlation matrix R, (for example, corr), given the data {x(m)}%j; however, the book toolbox 
function rx = autoc (x, L); computes 7,(/) according to (4.2.1) very efficiently. 

The estimate of covariance y,(/) from the data record {x(n)}j' is given by the sample autocovariance 
sequence 


l- 
1S [x(n+)- AJ -A 0<I<N-I 
n=0 
f(D =,D -~(N-1)</1<0 (4.2.4) 
0 elsewhere 


so that the corresponding autocovariance matrix Î, is nonnegative definite. Similarly, the sample autocorrelation 
coefficient sequence (,(/) is given by 


l 
pA)= 7) © (4.2.5) 
In the rest of this section, we assume that x(”) is a zero-mean process and hence ?;(/)=7,(/), so that we can 
discuss the autocorrelation estimate in detail. 
To determine the statistical quality of this estimator, we now consider its mean and variance. 
Mean of 7,(/). We first note that (4.2.1) can be written as 


r= 5 x(n+l)w(n+])x*(n)w(n) Ji = 0 (4.2.6) 
where wa) = m0) = TERRIS (4.2.7) 
0 elsewhere 


is the rectangular window. The expected value of 7,(/) is 


EGOE È Elan +x wanwa) 1> 0 


and EGODE -1< 0 
Therefore EF {D} = ir (Dr (D (4.2.8) 
where r (D=w*w)= Ý wnawn) (4.2.9) 


is the autocorrelation of the window sequence. For the rectangular window 
N-|l I| <N-1 
nD=m t] OA 


0 elsewhere 
which is the unnormalized triangular or Bartlett windows. Thus 
1 if 
Et? ()}= rat (J)w,(n) = „of E ) w (n) (4.2.11) 


Therefore, we conclude that the relation (4.2.1) provides a biased estimate of ”,(/) because the expected value of 
?<(@) from (4.2.11) is not equal to the true autocorrelation 7,(/). However, 7:(/) is an asymptotically unbiased 
estimator since if N —> œ, E{?,(/)} > r,(). Clearly, the bias is small if 7.(/) is evaluated for |/| <L,where L 


(4.2.10) 
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is the maximum desired lag and L &« N. 
Variance of ;,(1). An approximate expression for the covariance of 7,(/) is given by Jenkins and Watts 
(1968) 


ovii MALI =E È IR On +h -h) +n +r- (4.2.12) 


This indicates that successive values of 7,(/) may be highly correlated and that 7,(/) may fail to die out even if it 
is expected to. This makes the interpretation of autocorrelation graphs quite challenging because we do not know 
whether the variation is real or statistical. 

The variance of 7,(/), which can be obtained by setting / =/, in (4.2.12), tends to zero as N — œ. Thus, 
7,(/) provides a good estimate of r,(/) ifthe lag |/| is much smaller than N . However, as |/| approaches N, 
fewer and fewer samples of x(n) are used to evaluate 7,(/). As a result, the estimate 7,(/) becomes worse and 
its variance increases. 

Nonnegative definiteness of F,(7). An alternative estimator for the autocorrelation sequence is given by 


N-I-1 
an x(n+1)x"(n) 0</<L<N 
N-I ‘2 
f(D =5 ACD) -N<-L<1<0 (4.2.13) 
0 elsewhere 


Although this estimator is unbiased, it is not used in spectral estimation because of its negative definiteness. In 
contrast, the estimator 7,(/) from (2.4.33) is nonnegative definite, and any spectral estimates based on it do not 
have any negative values. Furthermore, the estimator 7,(/) has smaller variance and mean square error than the 
estimator ;,(/) (Jenkins and Watts 1968). Thus, in this book we use the estimator 7,(/) defined in (4.2.1). 


4.3 Estimation of the Power Spectrum of Stationary 
Random Signals 


From a practical point of view, most stationary random processes have continuous spectra. However, harmonic 
processes (i.e., processes with line spectra) appear in several applications either alone or in mixed spectra (a mixture 
of continuous and line spectra). We first discuss the estimation of continuous spectra in detail. The estimation of line 
spectra is considered in Chapter 8. 

The power spectral density of a zero-mean stationary stochastic process was defined in (2.1.44) as 


Re”) 2 E ret (4.3.1) 
l=—co 


assuming that the autocorrelation sequence r,(/) is absolutely summable. We will deal with the problem of 
estimating the power spectrum R,(e!”) of a stationary process x(n) from a finite record of observations 
{x(n)})! of a single realization. The ideal goal is to devise an estimate that will faithfully characterize the 
power-versus-frequency distribution of the stochastic process (i.e., all the sequences of the ensemble) using only a 
segment of a single realization. For this to be possible, the estimate should typically involve some kind of averaging 
among several realizations or along a single realization. 

In some practical applications (e.g., interferometry), it is possible to directly measure the autocorrelation ,r,(/), 
|Z| <L<N with great accuracy. In this case, the spectrum estimation problem can be treated as a deterministic one, 
as described in Section 4.1. We will focus on the “stochastic” version of the problem, where R,(e!”) is estimated 
from the available data {x(n)}\-'. A natural estimate of R,(e’”), suggested by (4.3.1), is to estimate r,(/) from 
the available data and then transform it by using (4.3.1). 


4.3.1 Power Spectrum Estimation Using the Periodogram 


The periodogram is an estimator of the power spectrum, introduced by Schuster (1898) in his efforts to search for 
hidden periodicities in solar sunspot data. The periodogram of the data segment {x(n)}j)! is defined by 
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= 5 IV (e2) (4.3.2) 
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where V (et?) is the DTFT of the windowed sequence 

v(n) = x(n)w(n) 0<n<WN-l (4.3.3) 
The above definition of the periodogram stems from Parseval’s relation on the power of a signal. The window w(n), ` 
which has length N , is known as the data window. Usually, the term periodogram is used when w(n) is a 
rectangular window. In contrast, the term modified periodogram is used to stress the use of nonrectangular windows. 
The values of the periodogram at the discrete set of frequencies {a =27k/N})"' can be calculated by 


ROS RCO”) ad V (k) f k=0, 1, =, N-1 (4.3.4) 


where V(k) is the N -point DFT of the windowed segment v(m). In MATLAB, the modified periodogram 
computation is implemented by using the function 
Rx = psd(x, Nfft, Fs, window(N), 'none’); 

where window is the name of any MATLAB-provided window function (e.g., hamming); Nfft is the size of the 
DFT, which is chosen to be larger than N to obtain a high-density spectrum (see zero padding in Section 4.1.1); 
and F's is the sampling frequency, which is used for plotting purposes. If the window boxcar is used, then we 
obtain the periodogram estimate. 

The periodogram can be expressed in terms of the autocorrelation estimate 7,(/) of the windowed sequence 
v(n) as (see Problem 4.9) 


N-I 
Re”)= > ADe” (4.3.5) 
l=-(N-1) 
which shows that R,(e’”) is a “natural” estimate of the power spectrum. From (4.3.2) it follows that R,(e’”) is 
nonnegative for all frequencies @. This results from the fact that the autocorrelation sequence PO) 
0 < |Z| < N-1, is nonnegative definite. If we use the estimate ř,(/) from (4.2.13) in (4.3.5) instead of 7,(/), 
the obtained periodogram may assume negative values, which implies that ř.(/) is not guaranteed to be 
nonnegative definite. 
The inverse Fourier transform of R,(e’’) provides the estimated autocorrelation 7,(/), that is, 


l ees 
r()=— | Ree da (4.3.6) 
AD => fae”) 
because 7,(/) and R,(e’’) forma DTFT pair. Using (4.3.6) and (4.2.1) for /=0, we have 
ES 1 | 
2 (0) =— 2__-_[ alendo (4.3.7) 
POFLE) 


Thus, the periodogram Re”) shows how the power of the segment {v(m)}j', which provides an estimate of the 
variance of the process x(n) , is distributed as a function of frequency. 


Filter bank interpretation. The above assertion that the periodogram describes a distribution of power as a 
function of frequency can be interpreted in a different way, in which the power estimate over a narrow frequency 
band is attributed to the output power of a narrow-bandpass filter. This leads to the well-known filter bank 
interpretation of the periodogram. To develop this interpretation, consider the basic (unwindowed) periodogram 
estimator R,(e’”) in (4.3.2), evaluated at a frequency @ + kA@ Ê 2zk/N , which can be expressed as 


2 N-I 
>, x(n)e?™ >" 
n=0 


N-I g 1 
Some] = 
n=0 N 
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since a, N = 21k (4.3.8) 
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Clearly, the term inside the absolute value sign in (4.3.8) can be interpreted as a convolution of x(n) and e 
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evaluated at n = N —1. Define 
1 


—e" = 0<n<N-1 
h,(n) 25 N (4.3.9) 
0 otherwise 
as the impulse response of a linear system whose frequency response is given by 
1. 
H,(e”)= 7[h,(n)]=—) "eo " 
N n=0 
N-1 —jN(@-a, ) E 
sAN iena E U (4.3.10) 
N zeri N eie) =] 


= L sin[ N(@—@,)/2] -j(N-1(@-@, )/2 
N  sin[{((@—@, )/2] 
which is a linear-phase, narrow-bandpass filter centered at @=a,. The 3-dB bandwidth of this filter is 
proportional to 27z/N rad per sampling interval (or 1/ N cycles per sampling interval). A plot of the magnitude 
response | H,(e”)|, for @=z2/2 and N =50, is shown in Figure 4.10, which evidently shows the narrowband 
nature of the filter. 
Continuing, we also define the output of the filter A(n) by y,(n), that is, 


N-1 
y(n) © y(n) x(n) =— E me" (43.11) 
m=0 
Then (4.3.8) can be written as 
Re) =N | y(N-1P (4.3.12) 


Now consider the average power in y, (), which can be evaluated using the spectral density as 
Filter response: w, = 2/2, N= 50 


Power (dB) 





FIGURE 4.10 
The magnitude of the frequency response of the narrow-bandpass filter for @, =2/2 and N=50 


ELD == [ Re") He") do 


Ao . 1 F 
x — R (e)*)=—R (e!™ (4.3.13) 
>n (e) N ed 


since H(e’°) is a narrowband filter. If we estimate the average power E{| yą(n)| } using one sample y:(N-—1), 
then from (4.3.13) the estimated spectral density is the periodogram given by (4.3.12), which says that the kth DFT 
sample of the periodogram [see (4.3.4)] is given by the average power of a single (N—1)st output sample of the 
@ -centered narrow-bandpass filter. Now imagine one such filter for each @, k =0,---,N—1, frequencies. Thus 
we have a bank of filters, each tuned to the discrete frequency (based on the data record length), providing the 
periodogram estimates every N samples. This filter bank is inherently built into the periodogram and hence need 
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not be explicitly implemented. The block diagram of this filter bank approach to the periodogram computation is 
shown in Figure 4.11. 





Yn NV) 
Bye") Nc R.(N-1) 
FIGURE 4.11 


The filter bank approach to the periodogram computation. 


In Section 5.1, we observed that the periodogram of a deterministic signal approaches the true energy spectrum 
as the number of observations M — œ. To see how the power spectrum of random signals is related to the number 
observations, we consider the following example. 


EXAMPLE 4.3.1 (PERIODOGRAM OF A SIMULATED WHITE NOISE SEQUENCE). Let x(n) be a stationary white 
Gaussian noise with zero-mean and unit variance. The theoretical spectrum of x(n) is 
R(e’)=o2=1 -n<@<n 

To study the periodogram estimate, 50 different N-point records of x(n) were generated using a pseudorandom number generator. 
The periodogram ĝ (e?) of each record was computed for @= @ =27k/1024, k=0, 1, ---, 512, that is, with 
Nrrr =1024, from the available data using (4.3.4) for N =32, 128, and 256. These results in the form of periodogram 
overlays (a Monte Carlo simulation) and their averages are shown in Figure 4.12. We notice that Re”) fluctuates so erratically 
that it is impossible to conclude from its observation that the signal has a flat spectrum. Furthermore, the size of the fluctuations (as 
seen from the ensemble average) is not reduced by increasing the segment length N. In this sense, we should not expect the 
periodogram Re”) to converge to the true spectrum R,(e’”) in some statistical sense as N—> oo. Since R,(e!”) is 
constant over frequency, the fluctuations of Re”) can be characterized by their mean, variance, and mean square error over 
frequency for each N and are given in Table 4.2. It can be seen that although the mean value tends to | (true value), the standard 
deviation is not reduced as N increases. In fact, it is close to 1; that is, it is of the order of the size of the quantity to be estimated. 
This illustrates that the periodogram is not a good estimate of the power spectrum. 


Since for each value of @, R,(e’”) is a random variable, the erratic behavior of the periodogram estimator, 
which is illustrated in Figure 4.12, can be explained by considering its mean, covariance, and variance. 


Table 4.2 
Performance of periodogram for white Gaussian 


noise signal in Example 43. 


N 32 128 256 
E[R(e™ )] 0.7829 0.8954 0.9963 
var[R (e™ )] 0.7232 1.0635 1.1762 
MSE 0.7689 1.07244 1.1739 


Mean of Re”). Taking the mathematical expectation of (4.3.5) and using (4.2.8), we obtain 


N- N-1 


ER > EGD =— > tk Der (4.3.14) 


l=-(N-1) l=-(N-1) 
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Periodogram overlay:N = 32 Periodogram average:N = 32 
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Periodogram overlay:N = 128 
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Periodogram average:N = 256 
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FIGURE 4.12 
Periodograms of white Gaussian noise in Example 4.3.1. 


Since E{R,(e’”)} # R.(e’”), the periodogram is a biased estimate of the true power spectrum R, (e}®) 
Equation (4.3.14) can be interpreted in the frequency domain as a periodic convolution. Indeed, using the 
frequency domain convolution theorem, we have 


E{R,(e"*)} = zy [ ReRe) do (4.3.15) 


where R (e) =|W(e”) (4.3.16) 


is the spectrum of the window. Thus, the expected value of the peroodogram is obtained bý convolving the true 
spectrum R,(e!”) with the spectrum R,(e!”) of the window. This is equivalent to windowing the true 
autocorrelation 7,(/) with the correlation or lag window r,(/) = w(/)*w(-/), where w(n) is the data window. 

To understand the implications of (4.3.15), consider the rectangular data window (4.2.7). Using (4.2.11), we see 
that (4.3.14) becomes 


E{R(e")= S =) oem (4.3.17) 


l=-(N-1) 
For nonperiodic autocorrelations, the value of 7,(/) becomes negligible for large values of |/|. Hence, as the 
record length N increases, the term (1—|/|/N)—>1 forall Z, which implies that 


lim E{R,(e””)} =R, (e°) (4.3.18) 
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that is, the periodogram is an asymptotically unbiased estimator of R, (e°) . In the frequency domain, we obtain 


eva] (4.3.19) 


R (e) = 7 {w, (1) * wg (D} =| W (e°) P= 
(E) = F {w * w D} =| We (e) jese 
eo sin (@N/2) 
sin (@/2) 

is the Fourier transform of the rectangular window. The spectrum Rw (e? ); in (4.3.19), of the correlation window 
t(l) approaches a periodic impulse train as the window length increases.’ As a result, E{R, (e*)} approaches 
the true power spectrum R,(e’’) as N approaches ©. 

The result (4.3.18) holds for any window that satisfies the following two conditions: 
1. The window is normalized such that 


where W, (e°) = (4.3.20) 


5'i w(n) =N (4.3.21) 
n=0 


This condition is obtained by noting that, for asymptotic unbiasedness, we want R,,(e!”)/N_ in (4.3.15) to be an 
approximation of an impulse. in the frequency domain. Since the area under the impulse function is unity, using 
(4.3.16) and Parseval’s theorem, we have 


1 . 1 X3 
— f wei")? do-— 2] (4.3.22) 
sy LIVEN do yar) 


2. The width of the mainlobe of the spectrum R,,(e!”) of the correlation window decreases as 1/ N . This condition 
guarantees that the area under R,,(e!”) is concentrated at the originas N becomes large. 

The bias is introduced by the sidelobes of the correlation window through leakage as illustrated in section 4.1. 
Therefore, we can reduce the bias by using the modified periodogram and a “better” window. Bias can be avoided if 
either N =00, in which case the spectrum of the window is a periodic train of impulses, or R, (e°) = 07, that is, 
x(n) hasa flat power spectrum. Thus, for white noise, Re” ) is unbiased for all N . This fact was apparent in 
Example 4.3.1 and is very important for practical applications. In the following example, we illustrate that the bias 
becomes worse as the dynamic range of the spectrum increases. 


EXAMPLE 4.3.2 (BIAS AND LEAKAGE PROPERTIES OF THE PERIODOGRAM)]. Consider an AR(2) process with 
a, =[l —0.75 0.5]' d, =1 (4.3.23) 
and an AR(4) process with 
a, =[1 —2.7607 3.8106 —2.6535 0.9238]" d, =1 (4.3.24) 


where w(n)~ WN(0,1). Both processes have been used extensively in the literature for power spectrum estimation studies 
(Percival and Walden 1993). Their power spectrum is given by (see Chapter 3) 





2 2 
R(e”) = Od pe Co. (4.3.25) 
AEE e 








For simulation purposes, M =1024 samples of each process were generated. The sample realizations and the shapes of the two 
power spectra in (4.3.25) are shown in Figure 4.13. The dynamic range of the two spectra, that is, max R,.(e!”)/ min R,(e!”), is 
about 15 and 65 dB, respectively. 


From the sample realizations, periodograms and modified periodograms, based on the Hanning window, were computed by using 
(4.3.4) at Nrrr =1024 frequencies. These are shown in Figure 4.14. The periodograms for the AR(2) and AR(4) processes, 
respectively, are shown in the top row while the modified periodograms for the same processes are shown in the bottom row. These 
plots illustrate that the periodogram is a biased estimator of the power spectrum. In the case of the AR(2) process, since the 
spectrum has a small dynamic range (15 dB), the bias in the periodogram estimate is not obvious; furthermore, the windowing in 
the modified periodogram did not show much improvement. On the other hand, the AR(4) spectrum has a large dynamic range, and 
hence the bias is clearly visible at high frequencies. This bias is clearly reduced by windowing of the data in the modified 
periodogram. In both cases, the random fluctuations are not reduced by the data windowing operation. 


>This spectrum is sometimes referred to as the Fejer kernel. 
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FIGURE 4.13 
Sample realizations and power spectra of the AR(2) and AR(4) processes used in Example 4.3.2. 
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FIGURE 4.14 


Illustration of properties of periodogram as a power spectrum estimator. 


EXAMPLE 4.3.3 (FREQUENCY RESOLUTION PROPERTY OF THE PERIODOGRAM). Consider two unit-amplitude 
sinusoids observed in unit variance white noise. Let 
x(n) = cos (0.352n + p, ) + cos (0.42n + g,)+V(n) 


where g and œ are jointly independent random variables uniformly distributed over [—2,2] and v(n) is a unit-variance 
white noise. Since two frequencies, 0.35m and 0.47, are close, we will need (see Table 4.1) 


yirt of NOS 
0.42 -0.357 
To obtain a periodogram ensemble, 50 realizations of x(n) for N=32 and N =64 were generated, and their periodograms 
were computed. The plots of these periodogram overlays and the corresponding ensemble average for N =32 and N=64 are 
shown in Figure 4.15. For N =32, frequencies in the periodogram cannot be resolved, as expected; but for N = 64 it is possible 
to separate the two sinusoids with ease. Note that the modified periodogram (i.e., data windowing) will not help since windowing 
increases smoothing and smearing of peaks. 
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Periodogram overlay:N = 32 Periodogram average:N = 32 
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FIGURE 4.15 
Illustration of the frequency resolution property of the periodogram in Example 4.3.3. 


The case of nonzero mean. In the periodogram method of spectrum analysis in this section, we assumed that the 
random signal has zero mean. If a random signal has nonzero mean, it should be estimated using (2.4.20) and then 
removed from the signal prior to computing its periodogram. This is because the power spectrum of a nonzero mean 
signal has an impulse at the zero frequency. If this mean is relatively large, then because of the leakage inherent in the 
periodogram, this mean will obscure low-amplitude, low-frequency components of the spectrum. Even though the 
estimate is not an exact value, its removal often provides better estimates, especially at low frequencies. 


Covariance of Re”) . Obtaining an expression for the covariance of the periodogram is a rather complicated 
process. However, it has been shown (Jenkins and Watts 1968) that 


cov{ R (e), R,(e)} 


_ Re rR (om)| {sin la +@N/21|" , { sin Ko -0,)N/2I (4.3.26) 
: 5 Nsin[(@, +@,)/2] N sin [(@, —@,)/2] 


This expression applies to stationary random signals with zero mean and Gaussian probability density. The 
approximation becomes exact if the signal has a flat spectrum (white noise). Although this approximation deteriorates 
for non-Gaussian probability densities, the qualitative results that one can draw from this approximation appear to 
hold for a rather broad range of densities. 


From (4.3.26), for œ =(27n/N)kı and @ =(27/N)k, with kı,kz integers, we have 
cov{R(e)R(e@)}=0 fork, #k, (4.3.27) 


Thus, values of the periodogram spaced in frequency by integer multiples of 22/N are approximately uncorrelated. 
As the record length N increases, these uncorrelated periodogram samples come closer together, and hence the rate 
of fluctuations in the periodogram increases. This explains the results in Figure 4.12. 


Variance of R,(e’”). The variance of the periodogram at a particular frequency @=@ =@ can be obtained 
from (4.3.26) 


A 2 
var{ĝ (e)} = R? efi + (ser) | (4.3.28) 


Nsin@ 
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For large values of N , the variance of R,(€’”) can be approximated by 


R?(e'”) 0<a<t 


(4.3.29) 
2R?(e’”) @=0, 7 


var{R,(e"”)} = l 


This result is crucial, because it shows that the variance of the periodogram (estimate) remains at the level of 
R? (ei?) (quantity to be estimated), independent of the record length N used. Furthermore, since the variance does 
not tend to zero as N — œ, the periodogram is not a consistent estimator; that is, its distribution does not tend to 
cluster more closely around the true spectrum as N increases.“ 

This behavior was illustrated in Example 4.3.1.The variance of R,(€e™™*) fails to decrease as N increases 
because the number of periodogram values R,(e’), k=0, 1, ---, N—1, is always equal to the length N of 
the data record. 


EXAMPLE 4.3.4 (COMPARISON OF PERIODOGRAM AND MODIFIED PERIODOGRAM) 


Consider the case of three sinusoids discussed in Section 4.1.4. In particular, we assume that these sinusoids are observed in white 
noise with 


x(n) = cos(0.35n + g,) + cos(0.42n + g,) + 0.25 cos(0.82n + ø) + v(n) 


where ø, ¢,, and ø, are jointly independent random variables uniformly distributed over [—x,2] and v(m) is a unit-variance 
white noise. An ensemble of 50 realizations of x(n) was generated using N=128. The periodograms and the Hamming 
window-based modified periodograms of these realizations were computed, and the results are shown in Figure 4.16. The top row 
of the figure contains periodogram overlays and the corresponding ensemble average for the unwindowed periodogram, and the 
bottom row shows the same for the modified periodogram. Spurious peaks (especially near the two close frequencies) in the 
periodogram have been suppressed by the data windowing poeration in the modified periodogram; hence the peak corresponding to 
0.8x is sufficiently enhanced. This enhancement is cleary at the expense of the frequency resolution (or smearing of the true peaks), 
which is to be expected. The overall variance of the noise floor is still not reduced. 


Periodogram overlay:N = 128 Periodogram average N = 128 
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FIGURE 4.16 
Comparison of periodogram and modified periodogram in Example 4.3.4. 


Failure of the periodogram 


To conclude, we note that the periodogram in its “basic form” is a very poor estimator of the power spectrum 


^The definition of the PSD by R,(e”)=lim,_,,.R,(e!”) is not valid because even if limy „„ E{R,(e”)}=R,(e”), the variance of 
R,(e”) does not tend to zeroas N —>œ (Papoulis 1991). 
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function. The failure of the periodogram when applied to random signals is uniquely pointed out in Jenkins and Watts 
(1968, p. 213): 


The basic reason why Fourier analysis breaks down when applied to time series is that it is based on the assumption of fixed 
amplitudes, frequencies and phases. Time series, on the other hand, are characterized by random changes of frequencies, amplitudes 
and phases. Therefore it is not surprising that Fourier methods need to be adapted to account for the random nature of a time series. 


The attempt at improving the periodogram by windowing the available data, that is, by using the modified 
periodogram in Example 4.3.4, showed that the presence and the length of the window had no effect on the variance. 
The major problems with the periodogram lie in its variance, which is on the order of R?(e!”), as well as in its 
erratic behavior. Thus, to obtain a better estimator, we should reduce its variance; that is, we should “smooth” the 
periodogram. From the previous discussion, it follows that the sequence R,(k), k=0, 1, +++, N—1, of the harmonic 
periodogram components can be reasonably assumed to be a sequence of uncorrelated random variables. Furthermore, 
it is well known that the variance of the sum of K uncorrelated random variables with the same variance is 1/ K 
times the variance of one of these individual random variables. This suggests two ways of reducing the variance, 
which also lead to smoother spectral estimators: 

e Average contiguous values of the periodogram. 

e Average periodograms obtained from multiple data segments. 
It should be apparent that owing to stationarity, the two approaches should provide comparable results under similar 
circumstances. 


4.3.2 Power Spectrum Estimation by Smoothing a Single 
Periodogram—The Blackman-Tukey Method 


The idea of reducing the variance of the periodogram through smoothing using a moving-average filter was first 
proposed by Daniel (1946). The estimator pepa by Daniel is a een moving-average filter, given by 








a (PS), ja, JQ; A jo; jo- 
ev’ y= et We ee (4.3.30) 
Ry Ca restos, ys È (eRe) 
where @ =(27/N)k, k=0, 1, =, N-1, W(e’”)4 VOM +1) , and the superscript (PS) denotes 
periodogram smoothing. Since the samples of the periodogram are approximately uncorrelated, 
PS lei% yi = P(e!” 4.3.31 
var e var{R (e 
J ie ETT ai e (4.3.31) 


that is, averaging 2M +1 consecutive spectral lines reduces the variance by a factor of 2M +1. The quantity 
a@ x (2n/N)(2M +1) determines the frequency resolution, since any peaks within the aœ range are smoothed 
over the entire interval Aœ into a single peak and cannot be resolved. Thus, increasing M reduces the variance 
(resulting in a smoother spectrum estimate), at the expense of spectral resolution. This is the fundamental tradeoff in 
practical spectral analysis. 


Blackman-Tukey approach 


The discrete moving average in (4.3.30) is computed in the frequency domain. We now introduce a better and 
simpler way to smooth the periodogram by operating on the estimated autocorrelation sequence. To this end, we note 
that the continuous frequency equivalent of the discrete convolution formula (4.3.30) is the periodic convolution 


7 ' 1 ee = 
Roe”) ae Re? W, (ede = Re) DW e) (4.3.32) 
T T 
where W,(e!”) isa periodic function of @ with period 27 , given by 
1 A0 
lol << 
W (e°) = Ao 2 (4.3.33) 
0 sosar 
2 


By using the convolution theorem, (4.3.32) can be written as 
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RPC’) = DY rw, Oe (4.3.34) 
l=-(L-1) 

where w,(/) is the inverse Fourier transform of W,(e'”) and L<N. As we have already mentioned, the 
window w,(/) is known as the correlation or lag window.” The correlation window corresponding to (4.3.33) is 

we sin (/a@/ 2) 
al 
Since w.(/) has infinite duration, its truncation at |/|=Z<N creates ripples in W,(e'”) (Gibbs effect). To 
avoid this problem, we use correlation windows with finite duration, that is, w.(/)=0 for |/|>L<N. For real 
sequences, where /,(/) is real and even, w,(/) [and hence W.(e'”) | should be real and even. Given that 
Re”) is nonnegative, a sufficient (but not necessary) condition that Roe”) be nonnegative is that 
W,(e'”)>0 for all æ. This condition holds for the Bartlett (triangular) and Parzen (see Problem 4.11) windows, 
but it does not hold for the Hamming, Hanning, or Kaiser window. _ 

Thus, we note that smoothing the periodogram R,(e’’) by convolving it with the spectrum 
W,(e'”) = F{w,(1)} is equivalent to windowing the autocorrelation estimate 7,(/) with the correlation window 
Ww. (/) . This approach to power spectrum estimation, which was introduced by Blackman and Tukey (1959), involves 
the following steps: 


~0<] <0 (4.3.35) 


1. Estimate the autocorrelation sequence from the unwindowed data. 
2. Window the obtained autocorrelation samples. 
3. Compute the DTFT of the windowed autocorrelation as given in (4.3.34). 


A pictorial comparison between the theoretical [i.e., using (4.3.32)] and the above practical computation of power 
spectrum using the single-periodogram smoothing is shown in Figure 4.17. 
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FIGURE 4.17 
Comparison of the theory and practice of the Blackman-Tukey method. 


The resolution of the Blackman-Tukey power spectrum estimator is determined by the duration 2L—1 of the 
correlation window. For most correlation windows, the resolution is measured by the 3-dB bandwidth of the 
mainlobe, which is on the order of 27/L rad per sampling interval. 

The statistical quality of the Blackman-Tukey estimate Roe”) can be evaluated by examining its mean, 
covariance, and variance. 

Mean of Roe”). The expected value of the smoothed periodogram Re”) can be obtained by using 


(4.3.34) and (4.2.11). Indeed, we have 


5 ; z 3 3 2 è 
The term spectral window is quite often used for W,(e’”) = F {w,(1)}, the Fourier transform of the correlation window. However, this 
term is misleading because W,(e!”) is essentially a frequency-domain impulse response. We use the term correlation window for 
w.(/) and the term Fourier transform of the correlation window for W,(e’®) . 
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EREC = DY ED De” 


l=-(L-1) 


(4.3.36) 

= =F r, Hl | w (Deze 

l=-(L-1) 
or, using the frequency convolution theorem, we have 
E{RE(e'*)} = R (1) OW, (e” OW, (e) (4.3.37) 
in (@N/2) | 

where we) =F 41 sai} (4.3.38) 

pe”) fı- ae Ww, (n nh eres 


is the Fourier thansform of the Bartlett window. Since F{ RESC} + R,(e'”), Re”) is a biased estimate of 
R, (e°). 
For L«WN,(1—-|/|/N)=1 and hence we obtain 


ER") S „o(1- k i) er 


I=-(L-1) 


= R (e) QW, (e) (4.3.39) 


i 1 jð j(@-8) 
=z [RCC WH, Aao 


If L is sufficiently large, the correlation window w,(/) consists of a narrow mainlobe. If R,(e’”) can be 
assumed to be constant within the mainlobe, we have 


EIRE Yh = RCC) Si AGO 
which implies that REPL) is i ii unbiased iy 
— | W,(e'’)da = w,(0)=1 (4.3.40) 
~ L F W,(e)da=w,(0) 


that is, if the spectrum of the correlation window has unit area. Under this condition, if both L and N tend to 
infinity, then W,(e'®) and W,(e!”) become periodic impulse trains and the convolution (4.3.37) reproduces 
R,(e!”). 

Covariance of Roe”). The following oe 


cov{R (eia), Ree!” )} = — = L fE REW AW (e840 (4.3.41) 


derived in Jenkins and Watts (1968), holds under the assumptions that (1) N is sufficiently large that W (e°) 
behaves as a periodic impulse train and (2) L is sufficiently large that W, (e1?) is sufficiently narrow that the 
product W, (e+? )W,(e@-®) is negligible. Hence, the covariance increases proportionally to the width of 
W,(e'”), and the amount of overlap between the windows W,(e/’~”) (centered at @,) and W,(e”~) 
(centered at @p ) increases. 

Variance of R ®(ei?). When @=@ =@, (4.3.41) gives 


var{p"(e!”)} = As 1 f R? (e W2(e) dO (4.3.42) 
2nN -* 
If R,(e}?) is smooth within the width of W, (e°), then 
1 , 
RE jæ 2/,j0 27,j0 
var (e = R (e )— | Wi (e’’)da (4.3.43) 
ReRe r Le 


or 
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var {RY(e'*)} = Se Re) O<a<n (4.3.44) 


where E, Af Wi(e’)da= $, wD (4.3.45) 
2g e a) 
is the energy of the correlation window. From (4.3.29) and (4.3.44) we have 


var{a\"(e”)} 


=E /N 0<a<n (4.3.46) 
var{ĝ,(e)} 


w 


which is known as the variance reduction factor or variance ratio and provides the reduction in variance attained by 
smoothing the periodogram. 

In the beginning of this section, we explained the variance reduction in terms of frequency-domain averaging. 
An alternative explanation can be provided by considering the windowing of the estimated autocorrelation. As 
discussed in Section 4.2 the variance of the autocorrelation estimate increases as |/| approaches N_ because fewer 
and fewer samples are used to compute the estimate. Since every value of 7,(/) affects the value of R,(@) at all 
frequencies, the less reliable values affect the quality of the periodogram everywhere. Thus, we can reduce the 
variance of the periodogram by minimizing the contribution of autocorrelation terms with large variance, that is, with 
lags close to N , by proper windowing. 

As we have already stressed, there is a tradeoff between resolution and variance. For the variance to be small, 
we must choose a window that contains a small amount of energy Ey. Since | w.(/)|S1,wehave Ew <2L. Thus, 
to reduce the variance, we must have L < N . The bias of Ae) is directly related to the resolution, which is 
determined by the mainlobe width of the window, which in turn is proportional to 1/ L . Hence, to reduce the bias, 
W,(e'”) should have a narrow mainlobe that demands a large L . The requirements for high resolution (small dias) 
and low variance can be simultaneously satisfied only if M is sufficiently large. The variance reduction for some 
commonly used windows is examined in Problem 4.12. Empirical evidence suggests that use of the Parzen window is 
a reasonable choice. 

Confidence intervals. In the interpretation of spectral estimates, it is important to know whether the spectral 
details are real or are due to statistical fluctuations. Such information is provided by the confidence intervals (Chapter 
2). When the spectrum is plotted on a logarithmic scale, the (1—@)x100 percent confidence interval is constant at 
every frequency, and it is given by (Koopmans 1974) 


2 — 
10log R®®(e)-— MAAE 10lo g RC) +10 log =—— (4.3.47) 
v Fas 
where po (4.3.48) 
2, wid 
k=-(L-1) 


is the degress of freedom ofa y? distribution. 


Computation of Roe”) using the DFT. In practice, the Blackman-Tukey power spectrum estimator is 
computed by using an N-point DFT as follows: 
1. Estimate the autocorrelation 7,(/), using the formula 
N+I-1 
RO=ACD=E Y x+) 1=0, b = L-1 (4.3.49) 
n=0 
For L>100, indirect computation of 7:(/) by using DFT techniques is usually more efficient (see Problem 
4.13). 
2. Form the sequence 


Px)w, (2) 0<i<tL-1 
f(D) =40 L<I<N-L (4.3.50) 
A(N-I)w (N-D)  N-L+1<1< N-I 
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3. Compute the power spectrum estimate 
REE) ooun = F (kK) = DFT {f(D} 0 <k < N-1 (4.3.51) 


as the N -point DFT of the sequence /(/). 

MATLAB does not provide a direct function to implement the Blackman-Tukey method. However, such a 
function can be easily constructed by using built-in MATLAB functions and the above approach. The book toolbox 
function 

Rx = bt_psd(x, Nfft, window, L); 
implements the above algorithm in which window is any available MATLAB window and Nfft is chosen to be 
larger than M to obtain a high-density spectrum. 


EXAMPLE 4.3.5 (BLACKMAN-TUKEY METHOD)]. Consider the spectrum estimation of three sinusoids in white noise 
given in Example 4.3.4, that is, 


x(n) = cos (0.357n +% ) + cos (0.42n + g,) + 0.25cos(0.82n + g,) +V(n) (4.3.52) 


where ø, g, and g, are jointly independent random variables uniformly distributed over [—2,2] and y(n) is a 
unit-variance white noise. An ensemble of 50 realizations of x(n) was generated using N =512. The autocorrelations of these 
realizations were estimated up to lag L=64, 128, and 256. These autocorrelations were windowed using the Bartlett window, 
and then their 1024-point DFT was computed as the spectrum estimate. The results are shown in Figure 4.18. The top row of the 
figure contains estimate overlays and the corresponding ensemble average for L = 64, the middle row for 1 =128, and the 
bottom row for L = 256. Several observations can be made from these plots. First, the variance in the estimate has considerably 
reduced over the periodogram estimate. Second, the lower the lag distance L, the lower the variance and the resolution (i.e., the 
higher the smoothing of the peaks). This observation is consistent with our discussion above about the effect of L on the quality 
of estimates. Finally, all the frequencies including the one at 0.87 are clearly distinguishable, something that the basic 
periodogram could not achieve. 
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FIGURE 4.18 
Spectrum estimation of three sinusoids in white noise using the Blackman-Tukey method in Example 4.3.5. 
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4.3.3 Power Spectrum Estimation by Averaging Multiple 
Periodograms—The Welch-Bartlett Method 


As mentioned in Section 4.3.1, in general, the variance of the sum of K IID random variables is 1/K times the 
variance of each of the random variables. Thus, to reduce the variance of the periodogram, we could average the 
periodograms from K different realizations of a stationary random signal. However, in most practical applications, 
only a single realization is available. In this case, we can subdivide the existing record {x(n),0<n<N-l} into K 
(possibly overlapping) smaller segments as follows: 


x,(n) = x(iD + n)w(n) O<n<L-l, O0<si< K-1 (4.3.53) 
where w(n) is a window of duration L and D is an offset distance. If D < L, the segments overlap; and for 


D = L, the segments are contiguous. The periodogram of the ith segment is 


2 


ooa I 
5 (ys +) y (enp (4.3.54) 
R (e°) zl ;(@®)| r 








L-1 , 
` x, (nje 
n=0 


We remind the reader that the window w(n) in (4.3.53) is called a data window because it is applied directly to the 
data, in contrast to a correlation window that is applied to the autocorrelation sequence [see (4.3.34)]. Notice that 
there is no need for the data window to have an even shape or for its Fourier transform to be nonnegative. The 
purpose of using the data window is to control spectral leakage. 

The spectrum estimate Roe”) is obtained by averaging x periodograms as follows: 


: K-1l . K-1 ` 
Reena 15 R..(e”) = chy X(e)P (4.3.55) 
Kim ` K ‘2 


where the superscript (PA) denotes periodogram averaging. To determine the bias and variance of Re”) ; 
we let D=L so that the segments do not overlap. The so-computed estimate a (ei?) is known as the Bartlett 
estimate. We also assume that 7,(/) is very small for |/|>L. This implies that the signal segments can be assumed 
to be approximately uncorrelated. To show that the simple periodogram averaging in Bartlett’s method reduces the 
periodogram variance, we consider the following example. 

EXAMPLE 4.3.6 (PERIODOGRAM AVERAGING). Let x(n) be a stationary white Gaussian noise with zero mean and unit 

variance. The theoretical spectrum of x(n) is 

R (e°)=0}=1 -n<ø<n 

An ensemble of 50 different 512-point records of x(n) was generated using a pseudorandom number generator. The Bartlett 

estimate of each record was computed for K =1 (i.e., the basic periodogram), K =4 (or L=128),and K =8 (or L=64). 

The results in the form of estimate overlays and averages are shown in Figure 4.19. The effect of periodogram averaging is clearly 

evident. 


Mean of . Roe”). The mean value of Ree”) is 


ERE ==> EIR) = ERO”) (43.56) 


where we have assumed that E{R,,(e’’)} = E{R,(€’)} because of the stationarity assumption. From (4.3.56) and 
(4.3.15), we have 


a(PA jø _ A joy _ l j j(@- 
EARE E} = ERED) = f REOR, ede (4.3.57) 


where R,,(e!”) is the spectrum of the data window w(n). Hence, RO”) is a biased estimate of R,(e’”). 
However, if the data window is normalized such that 


x w? (n) =L (4.3.58) 
=0 


the estimate RS (eie) becomes asymptotically unbiased [see the discussion following equation (4.3.15)]. 


x š ; 
Variance of RS (e!”). The variance of ~)(ei”) is 
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FIGURE 4.19 
Spectral estimation of white noise using Bartlett’s method in Example 4.3.6. 


var{po'(e'”)} == varf (e) (4.3.59) 
or using (4.3.29) gives 
varf EER) (4.3.60) 


Clearly, as K increases, the variance tends to zero. Thus, Ree”) provides an asymptotically unbiased and 
consistent estimate of R.(e!”). If N is fixed and N = KL, we see that increasing K to reduce the variance (or 
equivalently obtain a smoother estimate) results in a decrease in L, that is, a reduction in resolution (or equivalently 
an increase in bias). 

When w(7) in (4.3.53) is the rectangular window of duration L, the square of its Fourier transform is equal to 
the Fourier transform of the triangular sequence w, (n) L—|1|,|/ |< L, which when combined with the 1/ L 
factor in (4.3.57), results in the Bartlett window 


I-|I|/L [IK 
j= (4.3.61) 
Wel) l 0 elsewhere 
1 [sin (L/2) | 
with ACP Stee) (4.3.62) 
L| sin (@/2) 


This special case of averaging multiple nonoverlapping periodograms was introduced by Bartlett (1953). 
The method has been extended to modified overlapping periodograms by Welch (1970), who has shown that the 
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shape of the window does not affect the variance formula (4.3.59). Welch showed that overlapping the segments by 

50 percent reduces the variance by about a factor of 2, owing to doubling the number of segments. More overlap does 

not result in additional reduction of variance because the data segments become less and less independent. Clearly, 

the nonoverlapping segments can be uncorrelated only for white noise signals. However, the data segments can be 

considered approximately uncorrelated if they do not have sharp spectral peaks or if their autocorrelations decay fast. 
Thus, the variance reduction factor for the spectral estimator Ree”) is 


varf Rt e) 
var{R,(e’”)} 
and is reduced by a factor of 2 for 50 percent overlap. 


Confidence intervals. The (1—@)x100 percent confidence interval on a logarithmic scale may be shown to 
be (Jenkins and Watts 1968) 


1/K 0<a<t (4.3.63) 


Éx (1-a/2) 
2K 


(1010 "(e101 , 10log R°”'(e”) +10 log 2K ) (4.3.64) 


Xx (@/2) 


where y3, is a chi-squared distribution with 2K degrees of freedom. 

Computation of ge”) using the DFT. In practice, to compute RO?) at L equally spaced 
frequencies @ =27k/L,0<k<L-—1, the method of periodogram averaging can be easily and efficiently 
implemented by using the DFT as follows (we have assumed that L is even): 

1. Segment data {x(n)}"' into K segments of length L, each offset by D duration using 
x(n) = x(iD +n) 0<i<K-1,0<n<L-1 (4.3.65) 
If D=L, there is no overlap; andif D=L/2, the overlap is 50 percent. 
2. Window each segment, using data window w(n) 


x,(n) = x (n)w(n) = x(iD + n)w(n) 0<i<K-l10<n<L-l (4.3.66) 
3. Compute the N -point DFTs X;(k) ofthe segments x(n), 0 < i < K-l, 
ZA =Ý x (nym 0< kL- OSI S K-1 (4.3.67) 
4. Accumulate the squares | Y (WP 
HOLS (P O<k < L2 (4.3.68) 
5. Finally, normalize by KL to obtain the airo RO”): 
A= 5) 0 < k < N/2 (4.3.69) 


At this point we emphasize that the spectrum estimate RO (kb) is always nonnegative. A pictorial description of 
this computational algorithm is shown in Figure 4.20. A more efficient way to compute ROK) is examined in 
Problem 4.14. 


In MATLAB the Welch-Bartlett method is implemented by using the function 


Rx = psd(x, Nfft, Fs, windows{L), Noverlap, 'none’); 

where window is the name of any MATLAB-provided window function (e.g., hamming); Nfft is the size of the 
DFT, which is chosen to be larger than L to obtain a high-density spectrum; F's is the sampling frequency, which is 
used for plotting purposes; and Noverlap specifies the number of overlapping samples. If the boxcar window is 
used along with Noverlap=0, then we obtain Bartlett’s method of periodogram averaging. (Note that Noverlap 
is different from the offset parameter D given above.) If Noverlap=L/2 is used, then we obtain Welch’s averaged 
periodogram method with 50 percent overlap. 

A biased estimate 7.(/), |/|<Z, of the autocorrelation sequence of x(n) can be obtained by taking the 
inverse N-point DFT of RW if N2=2L—1. Since only samples of the continuous spectrum Ro”) are 
available, the obtained autocorrelation sequence 7{)(/) is an aliased version of the true autocorrelation 7(/) of 
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the signal x(n) (see Problem 4.14). 
Offset 
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FIGURE 4.20 
Pictorial description of the Welch-Bartlett method. 


EXAMPLE 4.3.7 (BARTLETT’S METHOD) Consider again the spectrum estimation of three sinusoids in white noise given in 
Example 4.3.4, that is, 


























x(n) = cos (0.3577 +% ) + cos (0.42 + g,) + 0.25 cos (0.877 + g,)+V(n) (4.3.70) 
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FIGURE 4.21 
Estimation of three sinusoids in white noise using Bartlett’s method in Example 4.3.7. 
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where @, g), and œ are jointly independent random variables uniformly distributed over [-n,n] and v(n) is a 
unit-variance white noise. An ensemble of 50 realizations of x(n) was generated using N =512. The Bartlett estimate of each 
ensemble was computed for K =1 (i.e., the basic periodogram), K =4 (or L=128), and K =8 (or L=64). The results 
in the form of estimate overlays and averages are shown in Figure 4.21. Observe that the variance in the estimate has consistently 
reduced over the periodogram estimate as the number of averaging segments has increased. However, this reduction has come at the 
price of broadening of the spectral peaks. Since no window is used, the sidelobes are very prominent even for the L =8 segment. 
Thus confidence in the œ= 0.8m spectral line is not very high for the ZL =8 case. 

EXAMPLE 4.3.8 (WELCH’S METHOD). Consider Welch’s method for the random process in the above example for 
N =512, 50 percent overlap, and a Hamming window. Three different values for L were considered; L = 256 (3 segments), 
L=128 (7 segments), and L=64 (15 segments). The estimate overlays and averages are shown in Figure 4.22. In comparing 
these results with those in Figure 4.21, note that the windowing has considerably reduced the spurious peaks in the spectra but has 
also further smoothed the peaks. Thus the peak at 0.87 is recognizable with high confidence, but the separation of two close 
peaks is not so clear for L =64. However, the L =128 case provides the best balance between separation and detection. On 
comparing the Blackman-Tukey (Figure 4.18) and Welch estimates, we observe that the results are comparable in terms of variance 
reduction and smoothing aspects. 
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FIGURE 4.22 
Estimation of three sinusoids in white noise using Welch’s method in Example 4.3.8. 


4.3.4 Some Practical Considerations and Examples 


The periodogram and its modified version, which is the basic tool involved in the estimation of the power spectrum of 
stationary signals, can be computed either directly from the signal samples {x(n)}j/"' using the DTFT formula 


N-I 2 


>. w(n)x(n)e!”” 


n=0 


pice 
5 (ei?) = (4.3.71) 
aa 








or indirectly using the autocorrelation sequence 
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N-I 
R.C*)= > AD” 
l=-(N-1) 
where ?,(/) is the estimated autocorrelation of the windowed segment {w(m)x(n)}i!. The periodogram ~,(e’”) 
provides an unacceptable estimate of the power spectrum because 
1. it has a bias that depends on the length N and the shape of the data window w(n) and 
2. its variance is equal to the true spectrum R,(e!”). 

Given a data segment of fixed duration N, there is no way to reduce the bias, or equivalently to increase the 
resolution, because it depends on the length and the shape of the window. However, we can reduce the variance either 
by averaging the single periodogram of the data (method of Blackman-Tukey) or by averaging multiple periodograms 
obtained by partitioning the available record into smaller overlapping segments (method of Bartlett-Welch). 


The method of Blackman-Tukey is based on the following modification of the indirect periodogram formula 
L-I ; 
Re”) = DE Aw De 
/=-(L-1) 
which basically involves windowing of the estimated autocorrelation sequence with a proper correlation window. 
Using only the first L « N more-reliable values of the autocorrelation sequence reduces the variance of the 
spectrum estimate by a factor of approximately L/ N . However, at the same time, this reduces the resolution from 
about 1/N toabout 1/ L . The recommended range for L is between 0.1N and 0.2N. 
The method of Bartlett-Welch is based on partitioning the available data record into windowed overlapping 
segments of length L, computing their periodograms by using the direct formula (4.3.71), and then averaging the 
resulting periodograms to compute the estimate 


(4.3.72) 


(4.3.73) 


2 


Ree” ae (4.3.74) 


1 
)=KL 





L-I , 
>, x (nye 
n=0 





whose resolution is reduced to approximately 1/L and whose variance is reduced by a factor of about 1/K, where K is 
the number of segments. 

The reduction in resolution and variance of the Blackman-Tukey estimate is achieved by “averaging” the values 
of the spectrum at consecutive frequency bins by windowing the estimated autocorrelation sequence. In the 
Bartlett-Welch method, the same effect is achieved by averaging the values of multiple shorter periodograms at the 
same frequency bin. The PSD estimation methods and their properties are summarized in Table 4.3. The multitaper 
spectrum estimation method given in the last column of Table 4.3 is discussed in Section 4.5. 


Table 4.3. Comparison of PSD estimation methods. 


Single-periodogram smoothing Multiple-periodogram 





Periodogram (Blackman-Tukey): averaging (Bartlett-Welch): Multitaper (Thomson): 
Re”) RC") Re”) Re") 
Description Compute DFT Compute DFT of windowed Split record into K segments Window data record 
using 
ofthe method of data record autocorrelation estimate and average their modified K orthonormal tapers and 
(see Figure 4.17) periodograms (see Figure 4.20) average their 
periodograms 
(see Figure 4.30) 
Basic idea Natural estimator of Local smoothing of Overlap data records For properly designed 
R (e”); the error ĝR.(e™") by weighting 7,(/) to create more segments; orthogonal tapers, 
In(O-FAD| is with a lag window w,(/) window segments to reduce periodograms are 
large for large |/| bias; average periodograms independent at each 
to reduce variance frequency. Hence 
averaging 
reduces variance 
Bias Severe for small N; Asymptotically unbiased Asymptotically unbiased Negligible for 
negligible for large N properly designed tapers 
Resolution o1/N œ< 1/L, L is maximum lag œ ]/L œ< ]/N 
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FIGURE 4.23 
Illustration of the properties of the power spectrum estimators using autocorrelation windowing (left column) and periodogram 
averaging (right column) in Example 4.3.9. 


EXAMPLE 4.3.9 (COMPARISON OF BLACKMAN-TUKEY AND WELCH-BARTLETT METHODS). Figure 4.23 
illustrates the properties of the power spectrum estimators based on autocorrelation windowing and periodogram averaging using 
the AR(4) model (4.3.24). The top plots show the power spectrum of the process. The left column plots show the power spectrum 
obtained by windowing the data with a Hanning window and the autocorrelation with a Parzen window of length L = 64, 128, 
and 256. We notice that as the length of the window increases, the resolution decreases and the variance increases. We see a similar 
behavior with the method of averaged periodograms as the segment length L increases from 64 to 256. Clearly, both methods give 
comparable results if their parameters are chosen properly. 


Example of ocean wave data. To apply spectrum estimation techniques discussed in this chapter to real data, we 
will use two real-valued time series that are obtained by recording the height of ocean waves as a function of time, as 
measured by two wave gages of different designs. These two series are shown in Figure 4.24. The top graph shows 
the wire wave gage data while the bottom graph shows the infrared wave gage data. The frequency responses of these 
gages are such that—mainly because of its inertia—frequencies higher than 1 Hz cannot be reliably measured. The 
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frequency range between 0.2 and 1 Hz is also important because the rate at which the spectrum decreases has a 
physical model associated with it. Both series were collected at a rate of 30 samples per second. There are 4096 


samples in each series.° We will also use these data to study joint signal analysis in the next section. 
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FIGURE 4.24 
Display of ocean wave data. 


EXAMPLE 4.3.10 (ANALYSIS OF THE OCEAN WAVE DATA)]. Figure 4.25 depicts the periodogram averaging and 
smoothing estimates of the wire wave gage data. The top row of plots shows the Welch estimate using a Hamming window, 
L = 256 , and 50 percent overlap between segments. The bottom row shows the Blackman-Tukey estimate using a Bartlett window 
and a lag length of L = 256. In both cases, a zoomed view of the plots between 0 and 1 Hz is shown in the right column to obtain 
a better view of the spectra. Both spectral estimates provide a similar spectral behavior, especially over the frequency range of 0 to 1 
Hz. Furthermore, both show a broad, low-frequency peak at 0.13 Hz, corresponding to a period of about 8 s. The dominant features 
of the time series thus can be attributed to this peak and other features in the 0- to 0.2-Hz range. The shape of the spectrum between 
0.2 and 1 Hz is a decaying exponential and is consistent with the physical model. Similar results were obtained for the infrared 
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Spectrum estimation of the ocean wave data using the Welch and Blackman-Tukey methods. 


These data were collected by A. Jessup, Applied Physics Laboratory, University of Washington. It was obtained from StatLib, a 
statistical archive maintained by Carnegie Mellon University. 
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4.4 Multitaper Power Spectrum Estimation 


Tapering is another name for the data windowing operation in the time domain. The rine dae estimate of the 
power spectrum, discussed in Section 4.3, is an operation on a data record {x(n)}*7]. One interpretation of this 
finite-duration data record is that it is obtained by truncating an infinite-duration process x(n) with a rectangular 
window (or taper). Since bias and variance properties of the periodogram estimate are unacceptable, methods for bias 
and variance reduction were developed either by smoothing estimates in the frequency domain (using lag windows) 
or by averaging periodograms computed over several short segments (data windows). Since these window functions 
(other than the rectangular one) typically taper the response toward both ends of the data record, windows are also 
referred to as tapers. 

In 1982, Thomson suggested an alternate approach for producing a “direct” (or “raw” periodogram-based) 
spectral estimator. In this method, rather than use a single rectangular data taper as in the periodogram estimate, 
several data tapers are used on the same data record to compute several modified periodograms. These modified 
periodograms are then averaged (with or without weighting) to produce the multitaper spectral estimate. The central 
premise of this multitaper approach is that if the data tapers are properly designed orthogonal functions, then, under 
mild conditions, the spectral estimates would be independent of each other at every frequency. Thus, averaging would 
reduce the variance while proper design of full-length windows would reduce bias and loss of resolution. Thomson 
suggested windows based on discrete prolate spheroidal sequences (DPSSs) that form an orthonormal set, although 
any other orthogonal set with desirable properties can also be used. This DPSS set is also known as the set of S/epian 
tapers. The multitaper method is different in spirit from the other methods in that it does not seek to produce highly 
smoothed spectra. Detailed discussions of the multitaper approach are given in Thomson (1982) and in Percival and 
Walden (1993). In this section, we provide a brief sketch of the algorithm. 


Estimation of Auto Power Spectrum 
Given a data record {x(n)}¥j of length N consider a set of K data tapers {w,(n);|0<n<N-10<k<K-l}. 
These tapers are assumed to be orthonormal, that is, 


1 k=l 


Simonmi, tr (4.4.1) 


Let R;,,.(€’’) be the periodogram estimator based on kth taper. Then, similar to (4.3.2), we obtain 


2 


Rid (e?) = (n)x(n)e (4.4.2) 








The simple averaged multitaper (MT) estimator is then defined by 
A jo l — a jo 
Rove! j= -9 Ri’ ) (4.4.3) 
K i 


A pictorial description of this multitaper algorithm is shown in Figure 4.26. Another approach, suggested by 
Thomson, is to apply adaptive weights (both frequency- and data-dependent) prior to averaging to protect against the 
biasing degradations of different tapers. 


In either case, the multitaper estimator is an average of direct spectral estimators (called eigenspectra by 
Thomson) employing an orthonormal set of tapers. Thomson (1982) showed that under mild conditions, the 
orthonormality of the tapers results in an approximate independence of each individual R, ,(e’’) at every frequency 
q@. This approximate independence further implies that the equivalent degrees of freedom for ROM”) are equal 
to twice the number of data tapers. This increase in degrees of freedom is enough to shrink the width of the 95 percent 
confidence interval for Roe”) and to reduce the variability to the point at which the overall shape of the 
spectrum is easily recognizable even though the spectrum is not highly smoothed. 

Clearly, the success of this approach lies in the selection of K orthonormal tapers. To understand the rationale 
behind the selection of these tapers, consider the bias or mean of R, ,(e’”). Following (4.3.15), we obtain 


EXR,,(€”)} = -5 FROM R, Ce) d0 (4.4.4) 
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FIGURE 4.26 
A pictorial description of the multitaper approach to power spectrum estimation. 


where R, (0!) = F {w,(n)* w, (—n)} =| W, (e°) ? (4.4.5) 
It follows, then, from (4.5.3) that 
a (MT) joy _ l jA\> j(@-8) 
ERIC) => [ROR Ae) 40 (4.4.6) 
K-1l 
where R,(e!”) 4 =D w, (e)? (4.4.7) 
k=0 


The function R,,(e’”) is the spectral window of the averaged multitaper estimator, which is obtained by averaging 
spectra of the individual tapers. Hence, for R,,(e!”) to produce a good leakage-free estimate RAT) , all K 
spectral windows must provide good protection against leakage. Therefore, each taper must have low sidelobe levels. 
Furthermore, the averaging of K individual periodograms also reduces the overall variance of pe”). The 
reduction in variance is possible ifthe R,,(e’”) are pairwise uncorrelated with common variance, in which case the 
variance reduces by a factor of 1/K. 

Thus, we need K orthonormal data tapers such that each one provides a good protection against leakage and 
such that the resulting individual spectral estimates are nearly uncorrelated. One such set is obtained by using DPSS 
with parameter W and of orders k =0,---,K —1, where K is chosen to be less than or equal to the number 2W (called 
the Shannon number, which is also a fixed-resolution bandwidth). The design of these sequences is discussed in detail 
in Thomson (1982) and in Percival and Walden (1993). In MATLAB these tapers are generated by using the 

[w]=dpss (L,W) function, where L is the length of 2W tapers computed in matrix w. 

The first four 21-point DPSS tapers with W =4 and their Fourier transforms are shown in Figure 4.27 while 
the next four DPSS tapers are shown in Figure 4.28. It can be seen that higher-order tapers assume both positive and 
negative values. The zeroth-order taper (like other windows) heavily attenuates data values near n=0 and n= L. 
The higher-order tapers successively give greater weights to these values to the point that tapers for k> K have 
very poor bias properties and hence are not used. This behavior is quite evident in the frequency domain where as the 
taper order increases, mainlobe width and sidelobe attenuation decrease. The multitapering approach can be interpreted 
as a technique in which higher-order tapers capture information that is “lost” when only the first taper is used. 


In MATLAB the function 
[Pxx, Pxxc, F] = PMTM(x, W, Nfft, Fs) 
estimates the power spectrum of the data vector x in the array Pxx, using the multitaper approach. The function uses 
DPSS tapers with parameter W and adaptive weighted averaging as the default method. The 95 percent confidence 
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interval is available in Pxxc. The size of the DFT used is Nf ft, the sampling frequency is Fs, and the frequency 
values are returned in the vector F. 

Another much simpler set of orthonormal tapers was suggested by Reidel and Siderenko (1995). This particular 
set contains harmonically related sinusoidal tapers. One important aspect of multitapering is to reduce the 
periodogram variance without reducing resolution caused by smoothing across frequencies. If the spectrum is 
changing slowly across the band so that sidelobe bias is not severe (recall the argument given for the unbiasedness of 
the periodogram for the white noise process), then sine tapers can reduce the variance. The kth taper in this set of 
k=0, 1, -:-, N-—1 tapers is given by 


w,(n) = J sin ct 1 =y Nal (4.4.8) 


where the amplitude term on the right is a p he factor that ensures orthonormality of the tapers. These sine 
tapers have much narrower mainlobe but also much higher sidelobes (recall the rectangular window) than the DPSS 
tapers. Thus they achieve a smaller bias due to smoothing by the mainlobe than the DPSS tapers, but at the expense 
of sidelobe suppression. Clearly this performance is acceptable if the spectrum is varying slowly. Owing to their 
simple nature, these tapers can be analyzed analytically, and it can be shown that (Reidel and Siderenko 1995) the 
k th sinusoidal taper has its spectral energy concentrated in the frequency bands 
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FIGURE 4.27 
DPSS data tapers for k=0, 1, 2, 3 inthe time and frequency domains. 
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DPSS data tapers for k = 4,5,6,7 in the time and frequency domains. 


If the first K <N tapers are used, then the multitaper estimator has the spectral window concentrated in the band 


eek. K+l (4.4.10) 
N+1 N+l 


A summary of the multitaper algorithm performance and its comparison with other PSD estimation methods are 
given in Table 4.3. 


EXAMPLE 4.5.1(THREE SINUSOIDS IN WHITE NOISE). Consider the random process x(n) containing three sinusoids in 
white noise discussed earlier, that is, 
x(n) = cos (0.3572 + p,) + cos (0.470 + —,) + 0.25cos(0.827n + p,)+v(n) 


Fifty realizations of x(n), 0 <n < N -—1, were processed using the PMTM function to obtain multitaper spectrum estimates for 
K=3, 5, and 7 Slepian tapers. The results are shown in Figure 4.29 in the form of overlays and averages. Several interesting 
observations and comparisons with the previous methods can be made. The number of tapers used in the estimation determines the 
variance and the smearing of the spectrum. When fewer tapers are used, the peaks are sharper and narrower but the noise variance is 
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larger. After increasing the number of tapers, the variance is decreased but the peaks become wider. When these estimates are 
compared with those from Welch’s method, an interesting feature can be noticed. The broadening of the peaks is not just at the base 
but is present along the entire length of the peak. Therefore, even with seven tapers, peaks are distinguishable. This feature is due to 
the bandwidth of the average spectral window due to K tapers. 
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FIGURE 4.29 
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Multitaper estimate overlay: K=3 
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Multitaper estimate average: K = 3 
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Spectrum estimation of three sinusoids in white noise using the multitaper method in Example 4.5.1. 


EXAMPLE 4.5.2 (OCEAN WAVE DATA). Consider the wire gage wave data of Figure 4.34. The multitaper estimate 
RE) of these 4096-point data is obtained using the PMTM function in which the parameter W is set to 4. The plots are shown 
in Figure 4.30. The upper graph shows the spectrum over 0 to 2 Hz while the lower graph shows a zoomed plot over 0 to 0.5 Hz for 
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Spectrum estimation of the wire gage wave data using the multitaper method in Example 4.5.2. 
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clarity. In each graph, the middle solid is the spectral estimate in decibels while the upper and lower solid curves are the upper and 
lower limits of the 95 percent confidence interval. For comparison purposes, the “raw” periodogram estimate is also shown as small 
dots. Clearly, the periodogram has a large variability that is reduced in the multitaper estimate. At the same time, the multitaper 
estimate is not smooth, but its variability is small enough to follow the shape of the overall structure. 


4.5 Summary 


In this chapter, we presented many different nonparametric methods for estimating the power spectrum of a 
wide-sense stationary random process. Nonparametric methods do not depend on any particular model of the process 
but use estimators that are determined entirely by the data. Therefore, one has to be very careful about the data and 
the interpretation of results based on them. 

We began by revisiting the topic of frequency analysis of deterministic signals. Since the spectrum estimation of 
random processes is based on the Fourier transformation of data, the purpose of this discussion was to identify and 
study errors associated with the practical implementation. In this regard, three problems—the sampling of the 
continuous signal, windowing of the sampled data, and the sampling of the spectrum—were isolated and discussed in 
detail. Some useful data windows and their characteristics were also given. This background was necessary to 
understand more complex spectrum estimation methods and their results. 

An important topic of autocorrelation estimation was considered next. Although this discussion was not directly 
related to spectrum estimation, its inclusion was appropriate since one important method (i.e., that of Blackman and 
Tukey) was based on this estimation. The statistical properties of the estimator and its implementation completed this 
topic. 

The major part of this chapter was devoted to the section on the auto power spectrum estimation. The classical 
approach was to develop an estimator from the Fourier transform of the given values of the process. This was called 
the periodogram method, and it resulted in a natural PSD estimator as a Fourier transform of an autocorrelation 
estimate. Unfortunately, the statistical analysis of the periodogram showed that it was not an unbiased estimator or a 
consistent estimator; that is, its variability did not decrease with increasing data record length. The modification of the 
periodogram using the data window lessened the spectral leakage and improved the unbiasedness but did not decrease 
the variance. Several examples were given to verify these aspects. 

To improve the statistical performance of the simple periodogram, we then looked at several possible 
improvements to the basic technique. Two main directions emerged for reducing the variance: periodogram 
smoothing and periodogram averaging. These approaches produced consistent and asymptotically unbiased estimates. 
The periodogram smoothing was obtained by applying the lag window to the autocorrelation estimate and then 
Fourier-transforming it. This method was due to Blackman and Tukey, and results of its mean and variance were 
given. The periodogram averaging was done by segmenting the data to obtain several records, followed by 
windowing to reduce spectral leakage, and finally by averaging their periodograms to reduce variance. This was the 
well-known Welch-Bartlett method, and the results of its statistical analysis were also given. Finally, implementations 
based on the DFT and MATLAB were given for both methods along with several examples to illustrate the 
performance of their estimates. That was based on applying several data windows or tapers to the data followed by 
averaging of the resulting modified periodograms. The basic principle behind this method was that if the tapers are 
orthonormal and properly designed (to reduce leakage), then the resulting periodograms can be considered to be 
independent at each frequency and hence their average would reduce the variance. 


Problems 


4.1 Let x,(t),-0o < t < œ , be a continuous-time signal with Fourier transform X,(F’), —00< F <oo,and let x(n) be obtained by 
sampling x(t) every T per sampling interval with its DTFT X(e!”). 
(a) Show that the DTFT X(e!”) is given by 

1 


X(e*)=FK > X.UE-IF) o=2f F, Sa 
l=—æœ 


(b) Let ¥,(k) be obtained by sampling X(e”) every 2n/N rad per sampling interval, that is, 


l=-20 


Then show that inverse DFT(Y,) is given by 


CHAPTER 4 Nonparametric Power Spectrum Estimation 13% 


x(n) £ IDFT(¥,) = x, (7) = 5 x (nT —mNT) 


m=% 


` 


4.2 MATLAB provides two functions to generate triangular windows, namely, bartlett and triang. These two functions actually 


4.3 


4.4 


4.5 


4.6 


4.7 


4.8 


4.9 


generate two slightly different coefficients. 

(a) Use bartlett to generate N=11, 31, and 51 length windows wp (n), and plot their samples, using the stem function. 

(b) Compute the DTFTs W,(e!”), and plot their magnitudes over [—1, 2]. Determine experimentally the width of the mainlobe 
as a function of N. Repeat part (a) using the triang function. How are the lengths and the mainlobe widths different in this 
case? Which window function is an appropriate one in terms of nonzero samples? 

(c) Determine the length of the bart lett window that has the same mainlobe width as that of a 51-point rectangular window. 

Sidelobes of the window transform contribute to the spectral leakage due to the frequency-domain convolution. One measure of this 

leakage is the maximum sidelobe height, which generally occurs at the first sidelobe for all windows except the Dolph-Chebyshev 

window. 

(a) For simple windows such as the rectangular, Hanning, or Hamming window, the maximum sidelobe height is independent oê 
window length N. Choose N=11, 31, and 51, and determine the maximum sidelobe height in decibels for the above windows. 

(b) For the Kaiser window, the maximum sidelobe height is controlled by the shape parameter and is proportional to 
2 / sinh £ . Using several values of and N , verify the relationship between #8 and the maximum sidelobe height. 

(c) Determine the value of f that gives the maximum sidelobe height nearly the same as that of the Hamming window of the 
same length. Compare the mainlobe widths and the window coefficients of these two windows. 

(d) For the Dolph-Chebyshev window, all sidelobes have the same height A in decibels. For A=40, 50 and 60 dB, determine the 
3-dB mainlobe widths for N=31 length window. 

Let x(n) be given by 

y(n) = cos æn + cos (wn +ø) and x(n) = y(n)w(n) 


where w(n) isa length-N data window. The | X(e!”)|? is computed using MATLAB and is plotted over [0,7]. 

(a) Let w(n) be a rectangular window. For @, = 0.251 and @ =0.3z2, determine the minimum length N so that the two 
frequencies in the | X(e!”)| plot are barely separable for any arbitrary ø €[—z,7]. (You may want to consider the worst 
possible value of ø or experiment, using several values of ø.) 

(b) Repeat part (a) for a Hamming window. 

(c) Repeat part (a) for a Blackman window. 

In this problem we will prove that the autocorrelation matrix R, given in (4.2.3), in which the sample correlations are defined by 

(4.2.1), is a nonnegative definite matrix, that is, 

x"Rx>0  forevery x>0 

(a) Show that R, can be decomposed into the product KX", where X is called a data matrix. Determine the form of X . 

(b) Using the above decomposition, now prove that x” Rx >0, for every x>0. 

An alternative autocorrelation estimate ;,(/) is given in (4.2.13) and is repeated below. 


N-I-1 
= x(n+I)x'(n) O<ISL<N 
D= me 
fit -N<-Ls/1<0 
0 elsewhere 


(a) Show that the mean of ;,(/) is equalto 7,(/) and an approximate expression for the variance of ;,(/) . 
(b) Show that the mean of the corresponding periodogram [that is, 2,(e’”) = 7[#,(/)]] is given by 


EIRE => f RIE do 


where W,(e!”) is the DTFT of the rectangular window and is sometimes called the Dirichlet kernel. 

Consider the above unbiased autocorrelation estimator ;,(/) ofa zero-mean white Gaussian process with variance o?. 
(a) Determine the variance of ;,(/). Compute its limiting value as / —> œ. 

(b) Repeat part (a) for the biased estimator 7:(/) . Comment on any differences in the results. 

Show that the autocorrelation matrix R, formed by using *,(/) is not nonnegative definite, that is, 


x"Rx<0  forsomex>0 


In this problem, we will show that the periodogram RL) can also be expressed as a DTFT of the autocorrelation estimate 


132 Statistical and Adaptive Signal Processing 


4.10 


4.11 


4.12 


FJ) given in (4.2.1). 
(a) Let v(n) = x(n)we(n), where we(n) isa rectangular window of length N. Show that 


r= 4v (CD) P.1) 


(b) Take the DTFT of (P.1) to show that 
s N- n 
R(= > aE 

l=-N+1 
Consider the following simple windows over 0 < n < N —1: rectangular, Bartlett, Hanning, and Hamming. 
(a) Determine analytically the DTFT of each of the above windows. 
(b) Sketch the magnitude of these Fourier transforms for N =31. 
(c) Verify your sketches by performing a numerical computation of the DTFT using MATLAB. 


The Parzen window is given by 
2 3 
dat] osm st 
L L 2 


2 P.2 
w,(l) = {1-+) > < W| <£ ua 
0 elsewhere 
(a) Show that its DTFT is given by 
n 4 
W, (ce) =| ECD] z0 (P.3) 
sin (@/4) 


Hence using the Parzen window as a correlation window always produces nonnegative spectrum estimates. 

(b) Using MATLAB, compute and plot the time-domain window wp(/) and its frequency-domain response Wp(e!”) for L=5, 
10, and 20. 

(c) From the frequency-domain plots in part (b) experimentally determine the 3-dB mainlobe width Aœ asa function of L. 


The variance reduction ratio of a correlation window w,(/) is defined as 
a (PS) jo 
w (Ce iN 0<a<n 
var{R,(e"")} 
where E -if W?(e!*)do= F w? (D) 
w 2m x a a 


I=-(L-1) 
(a) Using MATLAB, compute and plot £,, as a function of L for the following windows: rectangular, Bartlett, Hanning, Hamming, 
and Parzen. 
(b) Using your computations above, show that for L >> 1, the variance reduction ratio for each window is given by the formula in 
the following table. 


Window name Variance reduction factor 

Rectangular 2L/N 

Bartlett 0.667L/N 

Hanning 0.75L/N 

Hamming 0.7948L/N 

Parzen 0.539L/N 

4.13 For L>100, the direct computation of 7,(/) using (4.3.39) is time-consuming; hence an indirect computation using the DFT 


can be more efficient. This computation is implemented by the following steps: 

* Given the sequence {x(n)}‘7 , pad enough zeros to make it a (2N — 1) -point sequence. 

* Compute the N prr -point FFT of x(n) to obtain X (k), where Nrrr is equal to the next power-of-2 number that is greater 
than or equal to 2N —1. A 

+ Compute 1/N | Ž (k) |? to obtain R(k)- 

e Compute the Nrrr-point IFFT of R(k) to obtain 7,(/). 


4.14 


4.15 


4.16 


4.17 


4.18 
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Develop a MATLAB function rx = autocfft(x,L) which computes 7,(/),over —L < I< L . Compare this function with the 
autoc function discussed in the chapter in terms of the execution time for L 2100 . 
The Welch-Bartlett estimate ĝC®(k) is given by 


1 K-1 


Tr lX) 


i=0 


RW = 


If x(n) is real-valued, then the sum in the above expression can be evaluated more efficiently. Let K be an even number. Then we 
will combine two real-valued sequences into one complex-valued sequence and compute one FFT, which will reduce the overall 
computations. Specifically, let 


g,(n) = x,,(n)+ jx,,,,(”) n=0, 1, = L-1, r=0, L = +71 
Then the Z-point DFT of g,(n) is given by 
GAM) = RDE K=O Lo L-1 r=0 p oy, SI 
(a) Show that 
IGA) P + GAL- P= 20 XP +1 RaO k, r=0, -, SA 


(b) Determine the resulting expression for ES Xk) interms of G(k). 
(c) What changes are necessary if K is an odd number? Provide detailed steps for this case. 
Since p‘’”'(e”) isa PSD estimate, one can determine autocorrelation estimate 7{°)(/) from Welch’s method as 


FEAT) = — -f RENE do (P.4) 


F 


Let (k) bethe samples of R| (ei?) according to 


REO E REP) 0 <k < Nop l 


(a) Show that the IDFT ž®®(7) of R ZPACK) is an aliased version of the autocorrelation estimate 7?"(/) . 
(b) If the length of the overlapping A segment ir Welch’s method is L, how should Nprr be chosen to avoid aliasing in 
EAD? 
Show that the coherence function &33(œ@) is invariant under linear transformation, that is, if x,(n)=(n)*x(n) and 
yi(n) = h(n) * y(n) , then 
G,(@) = Zp (0) 
Bartlett’s method is a special case of Welch’s method in which nonoverlapping sections of length Z are used without windowing in 
the periodogram averaging operation. 
(a) Show that the ith periodogram in this method can be expressed as 


L 
R.A”) = > Fw, De (P.5) 
where w,(/) isa (2Z —1)-length Bartlett window. . 
(b) Let u(e’*)F [1 e =. e™DeF. Show that R,,(e””) in (P.5) can be expressed as a quadratic product 
Re) = u” (0) R, Me”) (P.6) 


where Èx: is the autocorrelation matrix of °,» ;(/) values. 
(c) Finally, show that the Bartlett estimate is given by 


REPES uR ae”) Em 
i=l 
In this problem, we will explore a spectral estimation technique that uses combined data and correlation weighting (Carter and 
Nuttall 1980). In this technique, the following steps are performed: 
e Given {x(n)}"j , compute the Welch-Bartlett estimate Ree”) by choosing the appropriate values of L and D. 
* Compute the autocorrelation estimate 7°”(J), -L < ] < L, using the approach described in Problem 4.15 
* Window 7{'(/), using a lag window w,(/) to obtain (D = 7° (Dw, (n). 
* Finally, compute the DTFT of 7%)() to obtain the new spectrum estimate p‘™)(ei”). 
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4.19 


4.20 


4.21 


4.22 


4.23 


4.24 


4.25 


4.26 


(a) Determine the bias of R“™'(e!”). 

(6) Comment on the effect of additional windowing on the variance and resolution of the estimate. 

(c) Implement this technique in MATLAB, and compute spectral estimates of the process containing three sinusoids in white noise, 
which was discussed in the chapter. Experiment with various values of L and with different windows. Compare your results to 
those given for the Welch-Bartlett and Blackman-Tukey methods. 

Explain why we use the scaling factor 


yw) 


which is the energy of the data window in the Welch-Bartlett method. 
Consider the basic periodogram estimator R,(e!”) at the zero frequency, that is, at @ = 0. 


(a) Show that 
7 1 X= J 2 N-I 2 
RL”) =D x(ne") =D x(n) 
n=0 n=0 














(b) If x(n) isa real-valued white Gaussian process with variance o?, determine the mean and variance of R,(e°). 
(c) Determine if R,(e"°) is a consistent estimator by evaluating the variance as N — œ. 

Consider Bartlett’s method for estimating R,(e°) using L =1; that is, we use nonoverlapping segments of single samples. The 
periodogram of one sample x(n) is simply | x(n) |’. Thus we have 


ajo, 1S. ~ jo, 1G 2 
ROM = 57D Rel) = 5, D120 
Again assume that x(n) isa real-valued white Gaussian process with variance o?. 
(a) Determine the mean and variance of Re”) A 
(b) Compare the above result with those in Problem 4.20. Comment on any differences. 
One desirable property of lag or correlation windows is that their Fourier transforms are nonnegative. 
(a) Formulate a procedure to generate a symmetric lag window of length 2L+1 with nonnegative Fourier transform. 
(b) Using the Hanning window as a prototype in the above procedure, determine and plot a 31-length lag window. Also plot its 
Fourier transform. 
Consider the following random process 


x(n)= S A, sin(@,n+ ¢,)+Vv(n) 
k=l 


where 
A =1 A, =0.5 A, =0.5 A, = 0.25 
o, =0.17 a, = 0.67 o, = 0.657 Q, = 0.87 


and the phases {¢}4, are IID random variables uniformly distributed over [—7,72]. Generate 50 realizations of x(n) for 
0 < n < 256. 
(a) Compute the Blackman-Tukey estimates for L=32, 64, and 128, using the Bartlett lag window. Plot your results, using overlay 
and averaged estimates. Comment on your plots. 
(b) Repeat part (a), using the Parzen window. 
(c) Provide a qualitative comparison between the above two sets of plots. 
Consider the random process given in Problem 4.23. 
(a) Compute the Bartlett estimate, using L=16, 32, and 64. Plot your results, using overlay and averaged estimates. Comment on 
your plots. 
(b) Compute the Welch estimate, using 50 percent overlap, Hamming window, and L=16, 32, and 64. Plot your results, using 
overlay and averaged estimates. Comment on your plots. 
(c) Provide a qualitative comparison between the above two sets of plots. 
Consider the random process given in Problem 4.23. 
(a) Compute the multitaper spectrum estimate, using K=3, 5, and 7 Slepian tapers. Plot your results, using overlay and averaged 
estimates. Comment on your plots. 
(b) Make a qualitative comparison between the above plots and those obtained in Problems 4.23 and 4.24. 
Generate 1000 samples of an AR(1) process using a = —0.9 . Determine its theoretical PSD. 
(a) Determine and plot the periodogram of the process along with the true spectrum. Comment on the plots. 
(b) Compute the Blackman-Tukey estimates for L=10, 20, 50, and 100. Plot these estimates along with the true spectrum. Comment 
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on your results. 
(c) Compute the Welch estimates for 50 percent overlap, Hamming window, and Z=10, 20, 50, and 100. Plot these estimates along 
with the true spectrum. Comment on your results. 
4.27 Generate 1000 samples of an AR(1) process using a = 0.9 . Determine its theoretical PSD. 
(a) Determine and plot the periodogram of the process along with the true spectrum. Comment on the plots. 
(b) Compute the Blackman-Tukey estimates for L=10, 20, 50, and 100 Plot these estimates along with the true spectrum. Comment 
on your results. 
(c) Compute the Welch estimates for 50 percent overlap, Hamming window, and L=10, 20, 50, and 100. Plot these estimates along 
with the true spectrum. Comment on your results. 
4.28 Multitaper estimation technique requires a properly designed orthonormal set of tapers for the desired performance. One set 
discussed in the chapter was that of harmonically related sinusoids given in (4.5.8). 
(a) Design a MATLAB function [tapers] = sine tapers (N,K) that generates K < MN sinusoidal tapers of length N. 
(b) Using the above function, compute and plot the Fourier transform magnitudes of the first 5 tapers of length 51. 
4.29 Design a MATLAB function Pxx = psd_sinetaper (x, K) that determines the multitaper estimates using the sine tapers. 
(a) Apply the function psd_sinetaper to the AR(1) process given in Problem 4.26, and compare its performance. 
(b) Apply the function psd_sinetaper to the AR(1) process given in Problem 4.27, and compare its performance. 
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CHAPTER 5 


Optimum Linear Filters 


In this chapter, we present the theory and application of optimum linear filters and predictors. We concentrate on 
linear filters that are optimum in the sense of minimizing the mean square error (MSE). The minimum MSE (MMSE) 
criterion leads to a theory of linear filtering that is elegant and simple, involves only second-order statistics, and is 
useful in many practical applications. The optimum filter designed for a given set of second-order moments can be 
used for any realizations of stochastic processes with the same moments. 

We start with the general theory of linear MMSE estimators and their computation. Then we apply the general 
theory to the design of optimum FIR filters and linear predictors for both nonstationary and stationary processes 
(Wiener filters). We continue with the design of nonparametric (impulse response) and parametric (pole-zero) 
optimum HR filters and predictors for stationary processes. Then we present the design of optimum filters for inverse 
system modeling, blind deconvolution, and their application to equalization of data communication channels. We 
conclude with a concise introduction to optimum matched filters and eigenfilters that maximize the output SNR. 
These signal processing methods find extensive applications in digital communication, radar, and sonar systems. 


5.1 Optimum Signal Estimation 


As we discussed in Chapter 1, the solution of many problems of practical interest depends on the ability to accurately 
estimate the value y(n) of a signal (desired response) by using a set of values (observations or data) from another 
related signal or signals. Successful estimation is possible if there is significant statistical dependence or correlation 
between the signals involved in the particular application. For example, in the linear prediction problem we use the M 
past samples x(n—1),x(n—2),---,x(n—M) ofa signal to estimate the current sample x(n) . The echo canceler in 
Figure 1.16 uses the transmitted signal to form a replica of the received echo. Although the signals in these and other 
similar applications have different physical origins, the mathematical formulations of the underlying signal 
processing problems are very similar. 

In array signal processing, the data are obtained by using M different sensors. The situation is simpler for 
filtering applications, because the data are obtained by delaying a single discrete-time signal; that is, we have 
x,(n) =x(n+1—k), 1 <k < M (see Figure 5.1). Further simplifications are possible in linear prediction, where 
both the desired response and the data are time samples of the same signal, for example, y(n)= x(n) and 
x, (n)=x(n—k), 1 < k < M . Asa result, the design and implementation of optimum filters and predictors are 
simpler than those for an optimum array processor. 

Since array processing problems are the most general ones, we will formulate and solve the following estimation 
problem: Given a set of data x,(n) for 1 < k < M , determine an estimate (n), of the desired response y(n), 
using the rule (estimator) 


3(n) = H{x,(n), 1 < k < M} (5.1.1) 


which, in general, is a nonlinear function of the data. When x,(n) = x(n+1—k), the estimator takes on the form of a 
discrete-time filter that can be linear or nonlinear, time-invariant or time-varying, and with a finite- or 
infinite-duration impulse response. Linear filters can be implemented using any direct, parallel, cascade, or 
lattice-ladder structure. 

The difference between the estimated response (n) and the desired response y(n), that is, 


e(n) = y(n)— $(n) (5.1.2) 
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Xyg(n) 
x(n—M) x(n—1) x(n) 
(a) 
x(n—M-1) x(n—M) x(n—M+1) +++ = x(n—2) x(n—-1) x(n) 
1 1 1 
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FIGURES.1 
Illustration of the data vectors for (a) array processing (multiple sensors) and (b) FIR filtering or prediction (single sensor) applications. 


is known as the error signal. We want to find an estimator whose output approximates the desired response as closely 
as possible according to a certain performance criterion. We use the term optimum estimator or optimum signal 
processor to refer to such an estimator. We stress that optimum is not used as a synonym for best; it simply means the 
best under the given set of assumptions and conditions. If either the criterion of performance or the assumptions about 
the statistics of the processed signals change, the corresponding optimum filter will change as well. Therefore, an 
optimum estimator designed for a certain performance metric and set of assumptions may perform poorly according 
to some other criterion or if the actual statistics of the processed signals differ from the ones used in the design. For 
this reason, the sensitivity of the performance to deviations from the assumed statistics is very important in practical 
applications of optimum estimators. 
Therefore, the design of an optimum estimator involves the following steps: 
1. Selection of a computational structure with well-defined parameters for the implementation of the estimator. 
2. Selection of a criterion of performance or cost function that measures the performance of the estimator under some 
assumptions about the statistical properties of the signals to be processed. 
3. Optimization of the performance criterion to determine the parameters of the optimum estimator. 
4. Evaluation of the optimum value of the performance criterion to determine whether the optimum estimator 
satisfies the design specifications. 


Many practical applications (e.g., speech, audio, and image coding) require subjective criteria that are difficult to 
express mathematically. Thus, we focus on criteria of performance that (1) only depend on the estimation error e(n), 
(2) provide a sufficient measure of the user satisfaction, and (3) lead to a mathematically tractable problem. We 
generally select a criterion of performance by compromising between these objectives. 

Since, in most applications, negative and positive errors are equally harmful, we should choose a criterion that 
weights both negative and positive errors equally. Choices that satisfy this requirement include the absolute value of 
the error | e(n)|, or the squared error | e(n) |’, or some other power of |e(n)| (see Figure 5.2). The emphasis put 
on different values of the error is a key factor when we choose a criterion of performance. For example, the 
squared-error criterion emphasizes the effect of large errors much more than the absolute error criterion. Thus, the 
squared-error criterion is more sensitive to outliers (occasional large values) than the absolute error criterion is. 

To develop a mathematical theory that will help to design and analyze the performance of optimum estimators, 
we assume that the desired response and the data are realizations of stochastic processes. Furthermore, although in 
practice the estimator operates on specific realizations of the input and desired response signals, we wish to design an 
estimator with good performance across all members of the ensemble, that is, an estimator that “works well on 
average.” Since, at any fixed time n, the quantities y(n), x,(n) for 1 <k < M , and e(n) are random 
variables, we should choose a criterion that involves the ensemble or time averaging of some function of | e(n)|. 
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Here is a short list of potential criteria of performance: 
1. The mean square error criterion 


P(n) = E{\e(n)[’} (5.1.3) 


which leads, in general, to a nonlinear optimum estimator. 
10 
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FIGURE 5.2 
Graphical illustration of various error-weighting functions. 


2. The mean @ th-order error criterion E{|e(n)|*}, @ #2. Using a lower- or higher-order moment of the absolute 
error is more appropriate for certain types of non-Gaussian statistics than the MSE (Stuck 1978). 
3. The sum of squared errors (SSE) 


E(n,,n,)= ¥ letn) k (5.1.4) 


which, if it is divided by n; —n; +1, provides an estimate of the MSE. 
The MSE criterion (5.1.3) and the SSE criterion (5.1.4) are the most widely used because they (1) are mathematically 
tractable, (2) lead to the design of useful systems for practical applications, and (3) can serve as a yardstick for 
evaluating estimators designed with other criteria (e.g., signal-to-noise ratio, maximum likelihood). In most practical 
applications, we use linear estimators, which further simplifies their design and evaluation. 

Mean square estimation is a rather vast field that was originally developed by Gauss in the nineteenth century. 
The current theories of estimation and optimum filtering started with the pioneering work of Wiener and Kolmogorov 
that was later extended by Kalman, Bucy, and others. Some interesting historical reviews are given in Kailath (1974) 
and Sorenson (1970). 


5.2 Linear Mean Square Error Estimation 


In this section, we develop the theory of linear MSE estimation. We concentrate on linear estimators for various 
reasons, including mathematical simplicity and ease of implementation. The problem can be stated as follows: 
Design an estimator that provides an estimate (n) of the desired response y(n) using a linear 
combination of the data x,(n) for 1<k <M , such thatthe MSE E{| y(n)— (n) } is minimized. 
More specifically, the linear estimator is defined by 


M 
Hn) => cnx (n) (5.2.1) 
k=l 


and the goal is to determine the coefficients c,(m) for 1 < k < M such that the MSE (5.1.3)is minimized. In 
general, a new set of optimum coefficients should be computed for each time instant n . Since we assume that the 
desired response and the data are realizations of stochastic processes, the quantities y(n), x,(n),...,Xy(n) are 
random variables at any fixed time n. For convenience, we formulate and solve the estimation problem at a fixed time 
instant 7 . Thus, we drop the time index n and restate the problem as follows: 
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Estimate a random variable y (the desired response) from a set of related random variables 
X15X2,---)Xy (data) using the linear estimator 


M 
ĵ 2 FaN =cly (5.2.2) 
k=l 
where x=[x, x = Xy l (5.2.3) 
is the input data vector and 
c=[c c =- cu] (5.2.4) 


is the parameter or coefficient vector of the estimator. 
Unless otherwise stated, all random variables are assumed to have zero-mean values. The number M of data 
components used is called the order of the estimator. The linear estimator (5.2.2) is represented graphically as shown 
in Figure 5.3 and involves a computational structure known as the linear combiner. The MSE 


P+ Ef{jef} (5.2.5) 


where eêy-ĵ (5.2.6) 


is a function of the parameters cg. Minimization of (5.2.5) with respect to parameters cx leads to a linear estimator €o 
that is optimun in the MSE sense. The parameter vector co is known as the linear MMSE (LMMSE) estimator and }, 
as the LMMSE estimate. 


Data Desired 
response 








© Error 


Estimate 






cu 
Estimator 
parameters 


FIGURE 5.3 
Block diagram representation of the linear estimator. 


5.2.1 Error Performance Surface 


To determine the linear MMSE estimator, we seek the value of the parameter vector € that minimizes the function 
(5.2.5). To this end, we want to express the MSE as a function of the parameter vector € and to understand the 
nature of this dependence. 

By using (5.2.5), (5.2.6), (5.2.2), and the linearity property of the expectation operator, the MSE is given by 


P(c) = Efje['}= E{(y—e"x)(y" —x"e)} 
= E{| y}-e" E{xy"}-E{yx"Jet+e"E{xx"}e 
or more compactly, 


P(c)=P,-c"d —d"c+c"Re (5.2.7) 
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where P, = E{I y P} (5.2.8) 
is the power of the desired response, 

d 4 E{xy*} (5.2.9) 
is the cross-correlation vector between the data vector x and the desired response y, and 

R = E{xx"} (5.2.10) 


is the correlation matrix of the data vector x. The matrix R is guaranteed to be Hermitian and nonnegative definite 
(see Section 2.2.4). 

The function P(c) is known as the error performance surface of the estimator. Equation (5.2.7) shows that the 
MSE P(c) (1) depends only on the second-order moments of the desired response and the data and (2) is a 
quadratic function of the estimator coefficients and represents an (M +1) -dimensional surface with M degrees of 
freedom. We will see that if R is positive definite, then the quadratic function P(c) is bowl-shaped and has a unique 
minimum that corresponds to the optimum parameters. The next example illustrates this fact for the second-order 
case. 

EXAMPLE 5.2.1. If M=2 and the random variables y,x,, and x, are real-valued, the MSE is 


P(c,,C,) = Fe —2d,c, —2d,c, + Rice +G + MCs 


because 2 = %,.And P(c,,c,) isa second-order function of coefficients C) and C}, and Figure 5.4 shows two plots of the 
function P(c,,C2) that are quite different in appearance. The surface in Figure 5.4(a) looks like a bowl and has a unique 
extremum that is a minimum. The values for the error surface parameters are P, =0.5, ñi = Mo = 4.5, ña =m =—0.1545, d, = —0.5 - 








FIGURES.4 

Representative surface and contour plots for positive definite and negative definite quadratic error performance surfaces. 
and d, =—0.1545. On the other hand, in Figure 5.4(b), we have a saddle point that is neither a minimum nor a maximum (here 
only the matrix elements have changed to 7, =m =1, ñ> = n, = 2)- If we cut the surfaces with planes parallel to the (Ci, Cz) 
plane, we obtain contours of constant MSE that are shown in Figure 5.4(c) and (d). In conclusion, the error performance surface is 
bowl-shaped and has a unique minimum only if the matrix R is positive definite (the determinants of the two matrices are 20.23 
and —3, respectively). Only in this case can we obtain an estimator that minimizes the MSE, and the contours are concentric 
ellipses whose center corresponds to the optimum estimator. The bottom of the bowl is determined by setting the partial derivatives 
with respect to the unknown parameters to zero, that is, 
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OP(c,,¢. : ; 

Wd O which results in Ti Fræ = d, 
dc, 5 

Pasa =0 which results in 707 +1,,c8 = d, 


2 


This is a linear system of two equations with two unknowns whose solution provides the coefficients c? and c3} that minimize 
the MSE function P(c,,c2)- 
When the optimum filter is specified by a rational system function, the error performance surface may be 
nonquadratic. This is illustrated in the following example. 


EXAMPLE 5.2.2 Suppose that we wish to estimate the real-valued output y(n) of the “unknown” system (see Figure 5.5) 


0.05—0.4z"' 


G(z) = ——_ 
(2) 1-1.1314z7'+0.25z7 


“Unknown ” 
system 





FIGURE 5.5 
Identification of an “unknown” system using an optimum filter. 
using the pole-zero filter 
b 


H(z)= = 
l—az 





by minimizing the MSE E{e?(n)} (Johnson and Larimore 1977). The input signal x(n) is white noise with zero mean and 
variance g2. The MSE is given by 
E{e*(n)} = E{Ly(n)— $(m)P} = Ely? (n)}-2E{ y(n) $(0)} + ELS?) 
and is a function of parameters b and 4 . Since the impulse response h(n)=ba"u(n) of the optimum filter has infinite 
duration, we cannot use (5.2.7) to compute FE fe (n)} and to plot the error surface. The three components of E fe’ ( n)} can be 
evaluated as follows, using Parseval’s theorem: The power of the desired response 
© 2 
(on hs 
Ely’ (n=) g (n)=— G GG(z")z" dz * 070; 
n=0 2nj l 
is constant and can be computed either numerically by using the first M “nonzero” samples of g(n) or analytically by evaluating 
the integral using the residue theorem. The power of the optimum filter output is 
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F < o. PNR 
E{$°(n)} = Ef (Y hn) === GHH) dz = 07 
n=0 2nj 
which is a function of parameters b and a. The middle term is 


E{y(n)5(n)} -£{5 gix(n—k) homxcn—m} 
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because E{x(n—k)x(n—m)}=026(m—k). For convenience we compute the normalized MSE 
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whose surface and contour plots are shown in Figure 5.6. We note that the resulting error performance surface is bimodal with a 
global minimum P=0.277 at (b,a)=(-0.311, 0.906) anda local minimum P=0.976 at (b,a)=(0.114,-0.519). Asa 
result, the determination of the optimum filter requires the use of nonlinear optimization techniques with all associated drawbacks. 


P(b, a) 
a 





EIGURE 5.6 
Illustration of the nonquadratic form of the error performance surface of a pole-zero optimum filter specified by the coefficients of 
its difference equation. 


5.2.2 Derivation of the Linear MMSE Estimator 


The approach in Example 5.2.1 can be generalized to obtain the necessary and sufficient conditions that determine the 
linear MMSE estimator. Here, we present a simpler matrix-based approach that is sufficient for the scope of this 
chapter. 

We first notice that we can put (5.2.7) into the form ofa “perfect square” as 


P(c) = P, -d "R~d +(Re—d)" R” (Re —d) (5.2.11) 


where only the third term depends on e. If R is positive definite, the inverse matrix R™' exists and is positive 
definite; that is, z" R™z >0 for all z #0. Therefore, if R is positive definite, the term d 4 R”'d > Odecreases the 
cost function by an amount determined exclusively by the second-order moments. In contrast, the term 
(Re —d)" R™'(Re—d)>O_ increases the cost function depending on the choice of the estimator parameters. Thus, 
the best estimator is obtained by setting Rce—d=0. 

Therefore, the necessary and sufficient conditions that determine the linear MMSE estimator €, are 


Re, =d (5.2.12) 
and R is positive definite (5.2.13) 
In greater detail, (5.2.12) can be written as 
hi Na Tm || ĉi d, 
m eae S d, (5.2.14) 
Tui u2 `U Tum \Lom dy 
where rj = E{x;x;} = r, (5.2.15) 
and d, = E{x,y*} (5.2.16) 


and are known as the set of normal equations. The invertibility of the correlation matrix R — and hence the existence 
of the optimum estimator — is guaranteed if R is positive definite. In theory, R is guaranteed to be nonnegative 
definite, but in physical applications it will almost always be positive definite. The normal equations can be solved by 


' For complex-valued random variables, there are some complications that should be taken into account because lel? is not an analytic 
function. 
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using any general purpose routine for a set of linear equations. 
Using (5.2.11) and (5.2.12), we find that the MMSE P, is 


P,=P,-d"R'd =P, -d"c, (5.2.17) 


where we can easily show that the term d"c, is equalto E {| $, 7}. the power of the optimum estimate. If x and 
y are uncorrelated (d =0), we have the worst situation (P, = P,) because there is no linear estimator that can 
reduce the MSE. If d #0, there is always going to be some reduction in the MSE owing to the correlation between 
the data vector X and the desired response y , assuming that R is positive definite. The best situation corresponds 
to )=y, which gives P, =0. Thus, for comparison purposes, we use the normalized MSE 


P, P 
a (5.2.18) 
P, P, 
because it is bounded between 0 and 1, that is, 
Osec<cl (5.2.19) 


If č is the deviation from the optimum vector C,, that is, if ¢=¢,+¢, then substituting into (5.2.11) and using 
(5.2.17), we obtain 


P(c, +€) = P(c,) + é"RE (5.2.20) 


Equation (5.2.20) shows that if R is positive definite, any deviation € from the optimum vector €, increases the 


MSE by an amount ¢”Ré > 0, which is known as the excess MSE, that is, 


Excess MSE = P(c, +€)— P(c,) =é"RE (5.2.21) 


We emphasize that the excess MSE depends only on the input correlation matrix and not on the desired response. 
This fact has important implications because any deviation from the optimum can be detected by monitoring the 
MSE. 

For nonzero-mean random variables, we use the estimator + cy +cH#x . The elements of R and d are replaced 
by the corresponding covariances and cy = E{y}—cHE{x} (see Problem 5.1). In the sequel, unless otherwise 
explicitly stated, we assume that all random variables have zero mean or have been reduced to zero mean by 
replacing y by y—E{y} and x by x—E{x}. 


5.2.3 Principal-Component Analysis of the Optimum Linear Estimator 


The properties of optimum linear estimators and their error performance surfaces depend on the correlation matrix R. 
We can learn a lot about the nature of the optimum estimator if we express R in terms of its eigenvalues and 
eigenvectors. Indeed, from Section 2.2.4, we have 


M 
R=QAQ"=)\Agq" and A=Q"RQ (5.2.22) 
i=l 
where 
A=diag{A,, A2, =, Aw} (5.2.23) 
are the eigenvaluse of R, assumed to be distinct, and 
Q=(4% % °° au] (5.2.24) 


are the eigenvectors of R. The modal matrix Q is unitary, that is, 
Q"Q=! (5.2.25) 
which implies that a'=Q". The relationship (5.2.22) between R and A is known as a similarity transformation. 
In general, the multiplication of a vector by a matrix changes both the length and the direction of the vector. We 
define a coordinate transformation of the optimum parameter vector by 


c,2Q%e, or c,£@c, (5.2.26) 
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Since =(Qc,)"Qc_ =O" Qc, =lle (5.2.27) 
the transformation (5.2.26) changes the direction of the transformed vector but not its length. 
If we substitute (5.2.22) into the normal equations (5.2.12), we obtain 
QAQ"c, =d or AQ"c, =Q"d 
which results in 
Ac, =d (5.2.28) 
where d#Q"d o d*Qd (5.2.29) 
is the transformed “decoupled” cross-correlation vector. 
Because A is diagonal, the set of M equations (5.2.8) can be written as 
Acoi=di 1<is<sM (5.2.30) 


where c,; and d; are the components of c, and d', respectively. This is an uncoupled set of M first-order 
equations. If A; #0, then 


1<isM (5.2.31) 


and if A, =0, the value of Ces is indeterminate. 
The MMSE becomes 
P =P, d"c, 
=P, (Qa "Qe, =P,-d%e, 


2 
=P, È d'c, =P È lat (5.2.32) 
which shows how the eigenvalues and the decoupled cross-correlations affect the performance of the optimum filter. 


The advantage of (5.2.31) and (5.2.32) is that we can study the behavior of each parameter of the optimum estimator 
independently of all the remaining ones. 





-> 
ci 





FIGURE 5.7 
Contours of constant MSE and principal-component axes for a second-order quadratic error surface. 


To appreciate the significance of the principal-component transformation, we will discuss the error surface of a 
second-order estimator. However, all the results can be easily generalized to estimators of order M, whose error 
performance surface exists in a space of M+1 dimensions. Figure 5.7 shows the contours of constant MSE for a 
positive definite, second-order error surface. The contours are concentric ellipses centered at the tip of the optimum 
vector C,. We define a new coordinate system with origin at €, and axes determined by the major axis y, and 
the minor axis ¥, of the ellipses. The two axes are orthogonal, and the resulting system is known as the principal 
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coordinate system. The transformation from the “old” system to the “new” system is done in two steps: 


appn c=c Fi (5.2.33) 
Rotation : v=Qc 


where the rotation changes the axes of the space to match the axes of the ellipsoid. The excess MSE (5.2.21) becomes 
M 

AP(¥) =" RE =C QAQ"č = F"AT = DAL FP (5.2.34) 
i=l 


which shows that the penalty paid for the deviation of a parameter from its optimum value is proportional to the 
corresponding eigenvalue. Clearly, changes in uncoupled parameters (which correspond to A, =0) do not affect the 
excess MSE. 

Using (5.2.22), we have 





M qid M d 
c, =R"d =QA'Q"d =) +q =) +q; (5.2.35) 
ia A ia Ai 
and the optimum estimate can be written as 
a a oe 
9,=CoX= DFG; x) (5.2.36) 
i=l 4; 


which leads to the representation of the optimum estimator shown in Figure 5.8. The eigenfilters q, decorrelate the 
data vector X into its principal components, which are weighted and added to produce the optimum estimate. 






Optimum 
estimate 


FIGURE 5.8 
Principal-components representation of the optimum linear estimator. 


5.2.4 Geometric Interpretations and the Principle of Orthogonality 


It is convenient and pedagogic to think of random variables with zero mean value and finite variance as vectors in an 
abstract vector space with an inner product (i.e., a Hilbert space) defined by their correlation 


<x, y >= E{xy’} (5.2.37) 
and the length of a vector by 
|x? £< x, x >= Ef] xP} <0 (5.2.38) 
From the definition of the correlation coefficient and the above definitions, we obtain 
kxy < tally (5.2.39) 
which is known as the Cauchy-Schwartz inequality. Two random variables are orthogonal, denoted by x L y, if 
<x, y >= E{xy"}=0 (5.2.40) 


which implies they are uncorrelated since they have zero mean. 
This geometric viewpoint offers an illuminating and intuitive interpretation for many aspects of MSE estimation 
that we will find very useful. Indeed, using (5.2.9), (5.2.10), and (5.2.12), we have 
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E{xe,} = E{x(y* —x"c,)} = E{xy"}-E{xx"}e, =d—Re.=0 
Therefore E{xe,}=0 (5.2.41) 
or E{x,e.}=0 for 1<m<M (5.2.42) 


that is, the estimation error is orthogonal to the data used for the estimation. Equations (5.2.41), or equivalently 
(5.2.42), are known as the orthogonality principle and are widely used in linear MMSE estimation. 

To illustrate the use of the orthogonality principle, we note that any linear combination cf x, +---+cyXy_ lies 
in the subspace defined by the vectors” X;,...,Xy . Therefore, the estimate § that minimizes the squared length of 
the error vector @, that is, the MSE, is determined by the foot of the perpendicular from the tip of the vector y to 
the “plane” defined by vectors %;,...,Xy . This is illustrated in Figure 5.9 for M =2. Since e, is perpendicular to 
every vector in the plane, we have x,, le, 1 <m < M, which leads to the orthogonality principle (5.2.42). 
Conversely, we can start with the orthogonality principle (5.2.4) and derive the normal equations. This interpretation 
has led to the name normal equations for (5.2.12). We will see several times that the concept of orthogonality has 
many important theoretical and practical implications. As an illustration, we apply the Pythagorean theorem to the 
orthogonal triangle formed by vectors ĵe, e,,and y, in Figure 5.9, to obtain 


I> = j 


or E{lyP}=E{|$,P}+ Elle, P} (5.2.43) 


*4| 








Yo e 














which decomposes the power of the desired response into two component, one that is correlated to the data and one 
that is uncorrelated to thd data. 





FIGURE 5.9 
Pictorial illustration of the orthogonality principle 


5.2.5 Summary and Further Properties 


We next summarize, for emphasis and future reference, some important properties of optimum, in the MMSE sense, 

linear estimators. 

1. Equations (5.2.12) and (5.2.17) show that the optimum estimator and the MMSE depend only on the second-order 
moments of the desired response and the data. The dependence on the second-order moments is a consequence of 
both the linearity of the estimator and the use of the MSE criterion. 

2. The error performance surface of the optimum estimator is a quadratic function of its coefficients. If the data 
correlation matrix is positive definite, this function has a unique minimum that determines the optimum set of 
coefficients. The surface can be visualized as a bowl, and the optimum estimator corresponds to the bottom of the 


2 We should be careful to avoid confusing vector random variables, that is, vectors whose components are random variables, and random 
variables interpreted as vectors in the abstract vector space defined by Equations (5.2.37) to (5.2.39). 
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bowl. 

3. If the data correlation matrix R is positive definite, any deviation from the optimum increases the MMSE according 
to (5.2.21). The resulting excess MSE depends on R only. This property is very useful in the design of adaptive 
filters. 

4. When the estimator operates with the optimum set of coefficients, the error e, is uncorrelated (orthogonal) to 
both the data xı, X2,..., Xm and the optimum estimate j, . This property is very useful if we want to monitor the 
performance of an optimum estimator in practice and is used also to design adaptive filters. 

5. The MMSE, the optimum estimator, and the optimum estimate can be expressed in terms of the eigenvalues and 
eigenvectors of the data correlation matrix. See (5.2.32), (5.2.35), and (5.2.36). 

6. The general (unconstrained) estimator 


FPA) He, Hs Xy) 


that minimizes the MSE 
P=E{| y-A(x)/’} 
with respect to h(x) is given by the mean of the conditional density, that is, 


§, 2h, (x) = Ely |x} = | yp, (|x) dy 


and clearly is a nonlinear function of X,,...,X,,. If the desired response and the data are jointly Gaussian, the 
linear MMSE estimator is the best in the MMSE sense; that is, we cannot find a nonlinear estimator that 
produces an estimate with smaller MMSE (Papoulis 1991). 


5.3 Optimum Finite Impulse Response Filters 


In the previous section, we presented the theory of general linear MMSE estimators [see Figure 5.1(a)]. In this section, 
we apply these results to the design of optimum linear filters, that is, filters whose performance is the best possible 
when measured according to the MMSE criterion [see Figure 5.1(b)]. The general formulation of the optimum 
filtering problem is shown in Figure 5.10. The optimum filter forms an estimate f(n) of the desired response y(n) 

by using samples from a related input signal x(n). The theory of optimum filters was developed by Wiener (1942) in 
continuous time and Kolmogorov (1939) in discrete time. Levinson (1947) reformulated the theory for FIR filters and 
stationary processes and developed an efficient algorithm for the solution of the normal equations that exploits the 
Toeplitz structure of the autocorrelation matrix R (see Section 6.4). For this reason, linear MMSE filters are often 
referred to as Wiener filters. 


Desired 
response 





Optimum 
filter 


FIGURE 5.10 
Block diagram representation of the optimum filtering problem. 

We consider a linear FIR filter specified by its impulse response h(n,k). The output of the filter is determined 
by the superposition summation 


3(n) sF h(n, k)x(n—k) (5.3.1) 
M 
2D cf (n)x(n—k +1) 2c" (n)x(n) (5.3.2) 
k=l 
where c(n) =[c,(n) c (n) = cy (n) (5.3.3) 


and x(n) =[x(n) x(n-1) --- x(n-M +1)]" (5.3.4) 
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are the filter coefficient vector’ and the input data vector, respectively. Equation (5.3.1) becomes a convolution if 
h(n,k) does not depend on n , that is, when the filter is time-invariant. The objective is to find the coefficient 
vector that minimizes the MSE E{|e(n)|?}. 

We prefer FIR over IIR filters because (1) any stable IIR filter can be approximated to any desirable degree by 
an FIR filter and (2) optimum FIR filters are easily obtained by solving a linear system of equations. 


5.3.1 Design and Properties 


To determine the optimum FIR filter ¢,(n), we note that at every time instant n, the optimum filter is the linear 
MMSE estimator of the desired response y(n) based on the data x(n). Since for any fixed n the quantities 
y(n), x(n), .., x(n—M +1) are random variables, we can determine the optimum filter either from (5.2.12) by 
replacing x by x(n), y by y(n), and c, by c,(n); or by applying the orthogonality principle (5.2.41). 
Indeed, using (5.3.41), (5.1.2), and (5.3.2), we have 


E{x(n)Ly"(n)—x" (n)e,(n)]} =0 (5.3.5) 
which leads to the following set of normal equations 


R(n)c,(n) =d(n) (5.3.6) 


where R(n) Ê E{x(n)x"(n)} (5.3.7) 


is the correlation matrix of the input data vector and 
d(n) = E{x(n)y*(n)} (5.3.8) 

is the cross-correlation vector between the desired response and the input data vector, that is, the input values stored 
currently in the filter memory and used by the filter to estimate the desired response. We see that, at every time n, the 
coefficients of the optimum filter are obtained as the solution of a linear system of equations. The filter c,(n) is 
optimum if and only if the Hermitian matrix R(n) is positive definite. 

To find the MMSE, we can use either (5.2.17) or the orthogonality principle (5.2.41). Using the orthogonality 
principle, we have 


P (n) = Efe, (n)Ly'(n)—x" (n)e,(n)]} 
=Efe,(n)y'(n)}__ due to orthogonality 
= E{[y(n)—x" (n)e,(n)]y"(n)} 
which can be written as 
Pn) =P,(n)—d"(n)e,(n) (5.3.9) 
The first term 
P (n) = E{| y(n) f} (5.3.10) 


is the power of the desired response signal and represents the MSE in the absence of filtering. The second term 
d” (n)c,(n) is the reduction in the MSE that is obtained by using the optimum filter. 

In many practical applications, we need to know the performance of the optimum filter in terms of MSE 
reduction prior to computing the coefficients of the filter. Then we can decide if it is preferable to (1) use an optimum 
filter (assuming we can design one), (2) use a simpler suboptimum filter with adequate performance, or (3) not use a 
filter at all. Hence, the performance of the optimum filter can serve as a yardstick for other competing methods. 

The optimum filter consists of (1) a linear system solver that determines the optimum set of coefficients from the 
normal equations formed, using the known second-order moments, and (2) a discrete-time filter that computes the estimate 
3(n) (see Figure 5.11). The solution of (5.3.6) can be obtained by using standard linear system solution techniques. In 
MATLAB, we solve (5.3.6) by copt=R\d and compute the MMSE by Popt=Py-dot (conj (d) ,copt). The 


>We define c,(n) = h* (n,k) in order to comply with the definition R(n) = E{x(n)x" (n)} of the correlation matrix. 
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optimum filter is implemented by yest=filter (copt,1,x). We emphasize that the optimum filter only needs 
the input signal for its operation, that is, to form the estimate of y(n); the desired response, if it is available, may be 
used for other purposes. 


Input signal 









Optimum 
estimate 


Linear system 
solver 
R(n)e,(n) = d(n) 


A priori 
information 


R(n) d(n) 


FIGURE 5.11 
Design and implementation of a time-varying optimum FIR filter. 

Conventional frequency-selective filters are designed to shape the spectrum of the input signal within a specific 
frequency band in which it operates. In this sense, these filters are effective only if the components of interest in the 
input signal have their energy concentrated within nonoverlapping bands. To design the filters, we need to know the 
limits of these bands, not the values of the sequences to be filtered. Note that such filters do not depend on the values 
of the data (values of the samples) to be filtered; that is, they are not data-adaptive. In contrast, optimum filters are 
designed using the second-order moments of the processed signals and have the same effect on all classes of signals 
with the same second-order moments. Optimum filters are effective even if the signals of interest have overlapping 
spectra. Although the actual data values also do not affect optimum filters, that is, they are also not data-adaptive, 
these filters are optimized to the statistics of the data and thus provide superior performance when judged by the 
Statistical criterion. 

The dependence of the optimum filter only on the second-order moments is a consequence of the linearity of the 
filter and the use of the MSE criterion. Phase information about the input signal or non-second-order moments of the 
input and desired response processes is not needed; even if the moments are known, they are not used by the filter. 
Such information is useful only if we employ a nonlinear filter or use another criterion of performance. 

The error performance surface of the optimum direct-form FIR filter is a quadratic function of its impulse 
response. If the input correlation matrix is positive definite, this function has a unique minimum that determines the 
optimum set of coefficients. The surface can be visualized as a bowl, and the optimum filter corresponds to the 
bottom of the bowl. The bottom is moving if the processes are nonstationary and fixed if they are stationary. In 
general, the shape of the error performance surface depends on the criterion of performance and the structure of the 
filter. Note that the use of another criterion of performance or another filter structure may lead to error performance 
surfaces with multiple local minima or saddle points. 


5.3.2 Optimum FIR Filters for Stationary Processes 


Further simplifications and additional insight into the operation of optimum linear filters are possible when the input 
and desired response stochastic processes are jointly wide-sense stationary. In this case, the correlation matrix of the 
input data and the cross-correlation vector do not depend on the time index n. Therefore, the optimum filter and the 
MMSE are time-invariant (i.e., they are independent of the time index n ) and are determined by 

Rc,=d (5.3.11) 


and P,=P,-d"c, (5.3.12) 
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Owing to stationarity, the autocorrelation matrix is 


r.(0) A) = M=) 
pè zO 2) ia alias (5.3.13) 
r(M-1) r(M-2) = (0) 


determined by the autocorrelation r,(/) = E{x(n)x*(n—1)} of the input signal. The cross-correlation vector between 
the desired response and the input data vector is 


d=[d, d, = dy T =[7,,) 7,,() = r (M -DI (5.3.14) 


and P, is the power of the desired response. For stationary processes, the matrix R is Toeplitz and positive definite 
unless the components of the data vector are linearly dependent. 
Since the optimum filter is time-invariant, it is implemented by using convolution 


M -1 
5,(n) = > h,(k)x(n-k) (5.3.15) 
k=0 


where h,(n)=C3,, is the impulse response of the optimum filter. 
Using (5.3.13), (5.3.14), h,(n)=c%,,, and r(l)=r'(-l), we can write the normal equations (5.3.11) more 
explicitly as 
M-I 
> h (k)r(m-k)=r (m) O0<m<M-1 (5.3.16) 
k=0 
which is the discrete-time counterpart of the Wiener-Hopf integral equation, and its solution determines the impulse 
response of the optimum filter. We notice that the cross-correlation between the input signal and the desired response 
(right-hand side) is equal to the convolution between the autocorrelation of the input signal and the optimum filter 
(left-hand side). Thus, to obtain the optimum filter, we need to solve a convolution equation. 
The MMSE is given by 
M-I 
P, =P, -9 h,(k)r;(k) (5.3.17) 
k= 
which is obtained by substituting (5.3.14) into (5.3.12). Table 5.2 summarizes the information required for the design 
of an optimum (in the MMSE sense) linear time-invariant filter, the Wiener-Hopf equations that define the filter, and 
the resulting MMSE. 


TABLE 5.2. 
Specification of optimum linear filters for stationary signals. The limits 0 and M—1 on the summations can be replaced by any 
values M, and M}. 


M4 
Filter and Error Definitions e(n) = y(n)- >), h(k)x(n-k) 
k=O 
Criterion of Performance P2E{\e(n)/}— minimum 
Wiener-Hopf Equations ŞS h,(k)r(m-k)=r (m), 0<m<M-1 
k0 

M-i . 
Minimum MSE P=P,-> h,(k)r,.(k) 

k=0 
Second-Order Statistics r (D) = E{x(n)x (n —1)}, P, = {ly(n)l’} 


r,(l)= E{y(n)x' (n—1)} 
To summarize, for nonstationary processes R(n) is Hermitian and nonnegative definite, and the optimum filter h(n) is time-varying. 
For stationary processes, R is Hermitian, nonnegative definite, and Toeplitz, and the optimum filter is time-invariant. A Toeplitz 
autocorrelation matrix is positive definite if the power spectrum of the input satisfies R, (e?) >0 for all frequencies æ. In both cases, 
the filter is used for all realizations of the processes. If M = œo, we have a causal IIR optimum filter determined by an infinite-order 
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linear system of equations that can only be solved in the stationary case by using analytical techniques (see Section 5.6). 

EXAMPLE 5.3.1 Consider a harmonic random process 
y(n) = Acos(@n+ ġ) 

with fixed, but unknown, amplitude and frequency, and random phase ø, uniformly distributed on the interval from 0 to 277. 
This process is corrupted by additive white Gaussian noise y(n) ~ N(0,o2) that is uncorrelated with y(n). The resulting 
signal x(n) = y(n)+v(n) is available to the user for processing. Design an optimum FIR filter to remove the corrupting noise 
v(n) from the observed signal x(n). 
Solution. The input of the optimum filter is x(n) , and the desired response is y(n). The signal y(n) is obviously unavailable, 
but to design the filter, we only need the second-order moments r,(/) and 1,x(/). We first note that since y(n) and y(n) are 
uncorrelated, the autocorrelation of the input signal is 


r,()=r(O+rO= T cos œl + 02 5(1) 


where r (1) =1/2 A’ cos @ is the autocorrelation of y(n). The cross-correlation between the desired response y(n) and the 


input signal x(n) is 
t(D) = E{y(n)[y(n-1)+v(n-1)]} =r, 
Therefore, the autocorrelation matrix R is symmetric Toeplitz and is determined by the elements r(0), r(1), .... r(M —1) of its 
first row. The right-hand side of the Wiener-Hopf equations is d=[7,(0) 7,(1) -+ r, (M —1)]". If we know 7,(J) and g2, we 
can numerically determine the optimum filter and the MMSE from (5.3.11) and (5.3.12). For example, suppose that 
A=0.5, fo = @/(27) =0.05, and g? =0.5. The input signal-to-noise ratio (SNR) is 
A’/2 


2 


v 


SNR, =10log = —6.02 dB 





The processing gain (PG), defined as the ratio of signal-to-noise ratios at the output and input of a signal processing system 
4 SNR, 
SNR, 


PG 


provides another useful measure of performance. 

The first problem we encounter is how to choose the order M of the filter. In the absence of any a priori information, we 
compute h, and P* for 1 < M < Mma =50 and PG and plot both results in Figure 5.1.2. We see that an M =20 order 
filter provides satisfactory performance. Figure 5.13 shows a realization of the corrupted and filtered signals. Another useful 
approach to evaluate how well the optimum filter enhances a harmonic signal is to compute the spectra of the input and output 
signals and the frequency response of the optimum filter. These are shown in Figure 5.14, where we see that the optimum filter has 
a sharp bandpass about frequency fo, as expected (for details see Problem 5.5). 
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Plots of (a) the MMSE and (b) the processing gain as a function of the filter order M . 

To illustrate the meaning of the estimator’s optimality, we will use a Monte Carlo simulation. Thus, we generate K =100 
realizations of the sequence x(¢,,n),0<n< N —1 (N =1000) ; we compute the output sequence 9(¢;,7), using (5.3.15); and 
then the error sequence e(¢,,n)=y(¢,,n)—3(¢,,n) and its variance P(¢;) . Figure (5.15) shows a plot of. 
ÊC), 1< Ġ; < K . We notice that although the filter performs better or worse than the optimum in particular cases, on average 
its performance is close to the theoretically predicted one. This is exactly the meaning of the MMSE criterion: optimum 
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performance on the average ( in the MMSE sense ) . 
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FIGURE 5.13 
Example of the noise-corrupted and filtered sinusoidal signals. 
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PSD of the input signal, magnitude response of the optimum filter, and PSD of the output signal. 
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Results of Monte Carlo simulation of the optimum filter. 
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For a certain realization, the optimum filter may not perform as well as some other linear filters; however, on 
average, it performs better than any other linear filter of the same order when all possible realizations of x(n) and 
y(n) are considered. 


5.3.3 Frequency-Domain Interpretations 


We will now investigate the performance of the optimum filter, for stationary processes, in the frequency domain. 
Using (5.2.7), (5.3.13), and (5.3.14), we can easily show that the MSE of an FIR filter A(n) is given by 


M-I M-I M-I M-I 
P= E{le(n)f}=7,(0)— > hkr (k) -X hk), (k) + > Y horl- khk O) (5.3.18) 
k=0 k=0 k=0 1=0 
The frequency response function of the FIR filter is 
M-I 
H (e°) 2) h(kye (5.3.19) 
k=0 
Using Parseval’s theorem, 
> sie f xX%)x3@")do (5.3.20) 
2n +r 


n=-—0o 


we can show that the MSE (6.4.18) can be expressed in the frequency domain as 
P=r,(0) “= f [H (e!®)R' (e?) + H* (e®)R (e°) -H (e)H* (e)R (edo (5.3.21) 
TU us 


where R,(e)”) is the PSD of x(n) and R,,(e’’) is the cross-PSD of y(n) and x(n) (see Problem 5.10). 
This formula holds for both FIR and IIR filters. 

If we minimize (5.3.21) with respect to H(e!”), we obtain the system function of the optimum filter and the 
MMSE. However, we leave this for Problem 5.11 and instead express (5.3.17) in the frequency domain by using 
(5.3.20). Indeed, we have 


P =r- [ HR e” )do 
; oo . (5.3.22) 
= f IR, (e?) - H, (eR (do 


where H,(e”) is the frequency response of the optimum filter. The above equation holds for any filter, FIR or IIR, 
as long as we use the proper limits to compute the summation in (5.3.19). 

We will now obtain a formula for the MMSE that holds only for IIR filters whose impulse response extends 
from —co to oo. In this case, (5.3.16) is a convolution equation that holds for —co < m < œ . Using the convolution 
theorem of the Fourier transform, we obtain 

joa 
H (e!”)= Ri") (5.3.23) 
°? Rie”) 


which, we again stress, holds for noncausal IIR filters only. Substituting into (5.4.22), we obtain 


1 [R (e) f jo 
=== ———_———_ ]R (e’”)d 
F, 2n L [l R, (e’”)R, (e*)! y (e ) o 


or P ->f l- 2, (eR (e) da (5.3.24) 


where &,(e/”) is the coherence function between x(n) and y(n). 

This important equation indicates that the performance of the optimum filter depends on the coherence between 
the input and desired response processes. As we recall from Section 4.4, the coherence is a measure of both the noise 
disturbing the observations and the relative linearity between x(n) and y(n). The optimum filter can reduce the 
MMSE at a certain band only if there is significant coherence, that is, F,(e!”)=1. Thus, the optimum filter 
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H(z) constitutes the best, in the MMSE sense, linear relationship between the stochastic processes x(n) and 
y(n). These interpretations apply to causal IIR and FIR optimum filters, even if (5.4.23) and (5.4.24) only hold 
approximately in these cases (see Section 5.6). 


5.4 Linear Prediction 


Linear prediction plays a prominent role in many theoretical, computational, and practical areas of signal processing 
and deals with the problem of estimating or predicting the value x(n) of a signal at the time instant n =m, by 
using a set of other samples from the same signal. Although linear prediction is a subject useful in itself, its 
importance in signal processing is also due, as we will see later, to its use in the development of fast algorithms for 
optimum filtering and its relation to all-pole signal modeling. 


5.4.1 Linear Signal Estimation 


Suppose that we are given a set of values x(n), x(n—1), ---, x(n—M) of a stochastic process and we wish to 
estimate the value of x(n—i), using a linear combination of the remaining samples. The resulting estimate and the 
corresponding estimation error are given by 


2(n-i) 4 E3 ci (n)x(n—k) (5.4.1) 
k=0 


kži 


e® (n) x(n-i) —R(n-i) 


and (5.4.2) 


M 
= > c (n)x(n-k) with c;(n) #1 
k=0 
where c(n) are the coefficients of the estimator as a function of discrete-time index n. The process is illustrated in 
Figure 5.16. 
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FIGURE 5.16 
Illustration showing the samples, estimates, and errors used in linear signal estimation, forward linear prediction, and backward linear 
prediction. 


To determine the MMSE signal estimator, we partition (5.4.2) as 
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i-l 


e(n)=>> c;(n)x(n—k) + x(n—i)+ > c,(n)x(n—k) 


k=0 k=i+l 
= c!\(n)x,(n) + x(n-i) +c} (n)x,(n) (5.4.3) 


$ [EPn] En) 
where the partitions of the coefficient and data vectors, around the ith component, are easily defined from the 
context. To obtain the normal equations and the MMSE for the optimum linear signal estimator, we note that 


; , x(n) 
Desired response = x(n — i) data vector = 
x(n) 


Using (5.4.6) and (5.4.9) or the orthogonality principle, we have 


E ral ei - [r (SAA) 
R(n) Ran) jem] [n0 
or more compactly‘ 
R® (nc (n) = -d (n) (5.4.5) 
and P® (n) = P(n-i) +1," (n)e,(n) +r; (n)e,(n) (5.4.6) 
Where for, i, j= 1, 2 
R,(n) = E{x,(n)x}(n)} (5.4.7) 
r,(n) = E{x ,(n)x (n-i)} (5.4.8) 
P.(n) = E{| x(n) |} (5.4.9) 


For various reasons, to be seen later, we will combine (5.5.4) and (5.5.6) into a single equation. To this end, we note 
that the correlation matrix of the extended vector 


x(n) 
x(n)=| x(n—-i) (5.4.10) 


x,(n) 
can be partitioned as 


R(n) n(n) R(n) 
R(n) = E{x(n)z"(n)} =| r” (n) P(n-i) r¥(n) (5.4.11) 
Ri(n) n(n) R,(n) 


with respect to its ith row and ith column. Using (5.5.4), (5.5.6), and (5.5.11), we obtain 


0 
R(n (n) =| P® (n) | + ith row (5.4.12) 
0 
which completely determines the linear signal estimator c® (n) and the MMSE P® (n). 


If M =2L and i= L, we have a symmetric linear smoother €(n) that produces an estimate of the middle 
sample by using the L past and the L future samples. The above formulation suggests an easy procedure for the 


‘the minus sign on the right-hand side of the normal equations is the result of arbitrarily setting the coefficient c,(n) â]. 
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computation of the linear signal estimator for any value of i, which is outlined in Table 5.3 and implemented by the 
function [ci, Pi] =lsigest (R,i). We next discuss two types of linear signal estimation that are of special 
interest and have their own dedicated notation. 


TABLE 5.3 
Steps for the computation of optimum signal estimators. 


1. Determine the matrix R(n) of the extended data vector x(n). 

2. Create the MxM submatrix R® (n) of R(n) by removing its ith row and its ith column. 

3. Create the M x1 vector d® (n) by extracting the ith column g“(n) of R(n) and removing its ith element. 
4. Solve the linear system R(n)e(n)=—d(n) toobtain c® (n). 


5. Compute the MMSE P®(n)= [O(m] r Ën). 


5.4.2 Forward Linear Prediction 


One-step forward linear prediction (FLP) involves the estimation or prediction of the value x(n) of a stochastic 
process by using a linear combination of the past samples x(n—1),---,x(n—M) (see Figure 5.16). We should 
stress that in signal processing applications of linear prediction, what is important is the ability to obtain a good 
estimate of a sample, pretending that it is unknown, instead of forecasting the future. Thus, the term prediction is used 
more with the signal estimation than forecasting in mind. The forward predictor is a linear signal estimator with 
i=0 and is denoted by 


ef (n)* x(n)+ 2 a, (n)x(n—k) (5.4.13) 


= x(n)+a"(n)x(n—1) 


where a(n) = [a,(n) a,(n) «++ ay (n)]" (5.4.14) 


is known as the forward linear predictor and a,(n) with aj(n)=1 as the FLP error filter. To obtain the normal 
equations and the MMSE for the optimum FLP, we note that for i = 0 , (5.4.11) can be written as 


Rn =| as (5.4.15) 
r'(n) R(n-1) 


where R(n) = E{x(n)x"(n)} (5.4.16) 
and r'(n) = E{x(n-1)x*(n)} (5.4.17) 
Therefore, (5.4.5) and (5.4.6) give 
R(n-1)a,(n) =-r! (n) (5.4.18) 
and P! (n) = P.(n)+r™(n)a,(n) (5.4.19) 
or Rn) : | = i 4 (5.4.20) 
a,(n) 0 


which completely specifies the FLP parameters. 
5.4.3 Backward Linear Prediction 


In this case, we want to estimate the sample x(n—M) in terms of the future samples x(n), x(n—1),...,.x(n-—M +1) 
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(see Figure 5.16). The term backward linear prediction (BLP) is not accurate but is used since it is an established 
convention. A more appropriate name might be postdiction or hindsight. The BLP is basically a linear signal 
estimator with ¿=M and is denoted by 


M-i 
e(n) 2 bi (n)x(n-k)+x(n-M) (5.4.21) 


= b” (n)x(n)+x(n-M) 


where b(n) =[b,(n) b(n) «++ by (n) (5.4.22) 


is the BLP and b,(n) with by(n) = 1 is the backward prediction error filter (BPEF). For i=M , (5.4.11) gives 


= R(n)  r’(n) 
R(n)= (5.4.23) 
(n) ke (n) ee 
where r°(n) Ê E{x(n)x*(n—M)} (5.4.24) 
The optimum backward linear predictor is specified by 
R(n)b,(n) =-r"(n) (5.4.25) 
and the MMSE is P?(n) = P.(n—M) +r" (n)b, (n) (5.4.26) 
and can be put in a single equation as 
R(n) a mn l 4 | (5.4.27) 
1 P, (n) 


In Table 5.4, we summarize the definitions and design equations for optimum FIR filtering and prediction. Using the 
entries in this table, we can easily obtain the normal equations and the MMSE for the FLP and BLP from those of the 
optimum filter. 


TABLE 5.4 
Summary of the design equations for optimum FIR filtering and prediction. 
Optimum filter FLP BLP 
Input data vector x(n) x(n—1) x(n) 
Desired response y(n) x(n) x(n—M) 
Coefficient vector h(n) a(n) b(n) 
Estimation error e(n)=y(n)-c"(n)x(n) e'(n) = x(n) —a"(n)x(n—1) e’(n) = x(n—M)-—b"(n)x(n) 
Normal equations R(n)h,(n)=d(n) R(n-1)a,(n) =-r'(n) R(n)b,(n) =-r*(n) 
MMSE P*(n) = P, (n)- c} (n)d (n) P' (n) = P (n) +a" (n)r! (n) P’(n)=P (n-M)+b"(n)r? (n) 
Required moments R(n) = E{x(n)x"(n)} r‘(n)=E{x(n—-1)x'(n)} r’(n) = E{x(n)x'(n—M)} 
d(n) = E{x(n)y'(n)} 
Stationary processes Rc, =4,R_ is Toeplitz Ra,=-T Rb, =—Jr >b, = Ja; 


5.4.4 Stationary Processes 


If the process x(n) is stationary, then the correlation matrix R(n) does not depend on the time n and it is 
Toeplitz 
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rO rD > rM) 
r'(M) r'(M-)) = r(0) 


Therefore, all the resulting linear MMSE signal estimators are time-invariant. if we define the correlation vector 


r Êê[r(1) r(2) -- (M) (5.4.29) 


where r(l)= E{x(n)x*(n—l1)}, we can easily see that the cross-correlation vectors for the FLP and the BLP are 


rf = E{x(n-1)x*(n)}=r* (5.4.30) 

and r° = E{x(n)x (n-M)}=Jr (5.4.31) 
0 0 -~ 1 

where J= EE owi (5.4.32) 
O41 = 0 
10- 0 


is the exchange matrix that simply reverses the order of the vector elements. Therefore, 


Ra, =-r" (5.4.33) 

Pf =r(0)+r"a, (5.4.34) 
Rb, =-Jr (5.4.35) 

P* =r(0)+r" Jb, (5.4.36) 


where the Toeplitz matrix R is obtained from R by deleting the last column and row. Using the centrosymmetry 
property of symmetric Toeplitz matrices 


RJ = JR? (5.4.37) 
and (5.5.33), we have 
JR'a} =-Jr or RJa,=-Jr (5.4.38) 
Comparing the last equation with (5.5.35), we have 
b, = Ja’ (5.4.39) 


that is, the BLP coefficient vector is the reverse of the conjugated FLP coefficient vector. Furthermore, from (5.4.34), 
(5.4.35), and (5.4.39), we have 

P, ê Pf = P? (5.4.40) 
that is, the forward and backward prediction error powers are equal. 

This remarkable symmetry between the MMSE forward and backward linear predictors holds for stationary 
processes but disappears for nonstationary processes. Also, we do not have such a symmetry if a criterion other than 
the MMSE is used and the process to be predicted is non-Gaussian (Weiss 1975; Lawrence 1991). 

EXAMPLE 5.4.1. To illustrate the basic ideas in FLP, BLP, and linear smoothing, we consider the second-order 
estimators for stationary processes. 
The augmented equations for the first-order FLP are 
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r(0) r) jap |_| Pf 
pa mle -| 0 | 
and they can be solved by using Cramer’s rule. Indeed, we obtain 
ae| a 
2 o r(O)}_ r(0)P' a det R, _ r?(0)—|r(1) P 
det R, det R, ' det R, r(0) 


det r(0) K 
a rD 0} -Pro ro 





a 


and 


for the MMSE and the FLP. For the second-order case we have 
r(0) ri) rD) |P; 
r'a) r0) r(1) |}a® |=] 0 
r°(2) r) rO) ja] | 0 


whose solution is 





2) _ P det R, -1> pf = det Rs 
det R, ? det R, 


i ra r() 
r2) rj rr 2-rOr' 


m 2) r0) 
a =— 
det R, det R, r?(0)-Ir(1) Ê 
pi cal 0 =| aa 0 a 
and w- rO roj [ro roj ro-ro 
: det R, det R, r°(0)-I rd) P 


Similarly, for the BLP 





r(0) ra) |]? |_| 0 
ra) roja] |2 
where b= 1, we obtain 
p> = det R, and po -70 
'  detR, ° O 


r(0) r rD) [0 
r) r(0) r4) |b®]|=| 0 
r°(2) r'a rO |b) (P 





p _ det Rs rop *DrD-rOr o _ r? -r0)r(2) 
2? det R r2(0)-Ir(1)P o  7?2(0)=1 r(1) Ê 
We note that Pf = P? a” = b 
and P; =P; Pah apab” 


which is a result of the stationarity of x(n) or equivalently of the Toeplitz structure of Rm, 
For the linear signal estimator, we have 
r r rD] fo 
rl) r0 r0 |]=|P, 
rD r rO|) |0 


with eP =]. Using Cramer’s rule, we obtain 
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det R 
P, =——i 
> det RP 


-raal a To aea 
o rD rO] [O 1©)}_r@r2)-rO@ra 


É = —— — 
? det R, det R? r’(0)-ir) P 


| r0) r®) | l r(0) r) | 

-F, det * * * * * * 

r(2) ra) zs r(2) r(Q) rr (2)-r0)r 0) 
det R, det R? r-r 


(2) _ 
CQ = 


from which we see that c? = c$?" ; that is, we have a linear phase estimator. 


5.4.5 Properties 


Linear signal estimators and predictors have some interesting properties that we discuss next. 
PROPERTY 5.4.1. If the process X(7) is stationary, then the symmetric, linear smoother has linear 
phase. 
Proof. Using the centrosymmetry property RJ = JR’ and (5.5.12) for M =2L, i= L, we obtain 
T= (5.4.41) 
that is, the symmetric, linear smoother has even symmetry and, therefore, has linear phase (see Problem 5.12). 
PROPERYT 5.4.2. If the process x(n) is stationary, the forward prediction error filter (PEF) 
1,aı,az,***,aų is minimum-phase and the backward PEF by,b,,---,by_;,1 is maximum-phase. 


x(n) s(n) ef(n) 
G(z) 1-qz! 
FIGURE 5.17 


The prediction error filter with one zero factored out. 


Proof. The system function of the M th-order forward PEF can be factored as 
M 
A(z) =1+ }° apz* =G(z)(-qz") 
k=l 
where q is a zero of A(z) and 
M-i 
G(z)=1+ > gz“ 
k=l 
isan (M —1)st-order filter. The filter A(z) can be implemented as the cascade connection of the filters G(z) and 1—gz"' 
(see Figure 5.17). The output s(n) of G(z) is 
s(n) = x(n) + g,x(n—1)+---+ gy_,.x(n—M +1) 
and it is easy to see that 
E{s(n—l)e"*(n)} =0 (5.4.42) 
because E{x(n—k)e"*(n)}=0 for 1< k <M . Since the output of the second filter can be expressed as 
e (n) = s(n) —qs(n-1) 





we have 
E{s(n—Ie"*(n)} = E{s(n—1)s" (n)}—q’ E{s(n—1)s*(n—-1)} =0 
which implies that 
qa > Iq) <1 
r,(0) 


because q is equal to the normalized autocorrelation of s(n). If the process x(n) is not predictable, that is, F{| e (n) P }+0, 
we have 


CHAPTER 5 Optimum Linear Filters 161 


E{\e'(n) /} = Efe’ (n)[s"(n)—q's* (n-1)]} 
=Efe'(n)s*(n)}— due to (5.4.42) 
= E{[s(n)— qs(n—1)s*(n)]} 
=r (0-14 °)#0 


which implies that lą| <1 


and 


that is, the zero q of the forward PEF filter is strictly inside the unit circle. Repeating this process, we can show that all zeros of A(z) 
are inside the unit circle; that is, A(z) is minimum-phase. This proof was presented in Vaidyanathan et al. (1996). The property 
b=Ja* is equivalent to 
-M 4* 1 
B(z)=z7" A| = 
A 
which implies that B(z) is a maximum-phase filter. 
PROPERTY 5.4.3. The forward and backward prediction error filters can be expressed in terms of the eigenvalues 7, and the 


eigenvectors g, of the correlation matrix R(n) as follows 


1 M+ 1 
= pf EE i (5.4.43) 
EA BOD glan 
ka z OJ i. Auss (5.4.44) 
[ 1 = A; 


where g;,; and @;4,,are the first and last components of q; . The first equation of (5.4.43) and the last equation in (5.4.44) can 

be solved to provide the MMSEs P!(n) and P(n) , respectively. 

Proof. See Problem 5.14. 

PROPERTY 5.4.4. The MMSE prediction errors can be expressd as 
p — det R(n) 


gg RD P*(n) = det R(n) (5.4.45) 
det R(n—1) 


~ det R(n) 
Proof. Problem 5.17. 


The previous concepts are illustrated in the following example. 


EXAMPLE 5.4.2. A random sequence x(n) is generated by passing the white Gaussian noise process @(n) ~ WN(0,1) 
through the filter 


x(n) = axn) +5 axn -1) 


Determine the second-order FLP, BLP, and symmetric linear signal smoother. 


Solution. The complex power spectrum is 


1 1 1 3 1 
R(z)=H(z)A(z") =(1t+=z')d+=—2z)==z+=+=2" 
(z) WEG J= Utt 
Therefore, the autocorrelation sequence is equal to r(0)=5/4, r(+1)=1/2, r(/)=0 for |/|=2. Since the power spectrum 
R(e’”) =5/4+cos@>O0 forall æ, the autocorrelation matrix is positive definite. The same is true of any principal submatrix. 


To determine the second-order linear signal estimators, we start with the matrix 


5 1 4g 

4 2 
g-|1 2 1 
2 42 
9 2 3 
2 4 


and follow the procedure outlined in Section 5.4.1 or use the formulas in Table 5.3. The results are 
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Forward linear prediction (i = 0): {a,}— {l, —0.476, 0.190} Pi =1.0119 
Symmetric linear smoothing (i = 1): {c,} > {-0.4, 1, —0.4} P* =0.8500 
Backward linear prediction (i = 2) : {b,} — {0.190, —0.476, 1} P* =1.0119 


The inverse of the correlation matrix R is 


0.9882 —0.4706 0.1882 
R` =|—0.4706 1.1765 -0.4706 
0.1882 —0.4706 0.9882 


and we see that dividing the first, second, and third columns by 0.9882, 1.1765, and 0.9882 provides the forward PEF, the 
symmetric linear smoothing filter, and the backward PEF, respectively. The inverses of the diagonal elements provide the MMSEs 
Pf, Ps, and Pt. The reader can easily see, by computing the zeros of the corresponding system functions, that the FLP is 
minimum-phase, the BLP is maximum-phase, and the symmetric linear smoother is mixed-phase. It is interesting to note that the 
smoother performs better than either of the predictors. 


5.5 Optimum Infinite Impulse Response Filters 


So far we have dealt with optimum FIR filters and predictors for nonstationary and stationary processes. In this 
section, we consider the design of optimum IIR filters for stationary stochastic processes. For nonstationary processes, 
the theory becomes very complicated. The Wiener-Hopf equations for optimum IIR filters are the same for FIR filters; 
only the limits in the convolution summation and the range of values for which the normal equations hold are 
different. Both are determined by the limits of summation in the filter convolution equation. We can easily see from 
(5.3.16) and (5.3.17), or by applying the orthogonality principle (5.2.41), that the optimum IIR filter 


5(n) =} h, (k)x(n-k) (5.5.1) 
k 
is specified by the Wiener-Hopf equations 
>) h (k)r(m-k)=r„(m) (5.5.2) 
k 
and the MMSE is given by 
P, =r, (0)-), h,(k)r;.(k) (5.5.3) 
k 


where r (l) is the autocorrelation of the input stochastic process x(n) and ,,,(J) is the cross-correlation between 
x(n) and desired response process y(n). We assume that the processes x(n) and y(n) are jointly wide-sense 
stationary with zero mean values. 

The range of summation in the above equations includes all the nonzero coefficients of the impulse response of 
the filter. The range of k in (5.5.1) determines the number of unknowns and the number of equations, that is, the 
range of m. For IIR filters, we have an infinite number of equations and unknowns, and thus only analytical 
solutions for (5.5.2) are possible. The key to analytical solutions is that the left-hand side of (5.5.2) can be expressed 
as the convolution of h,(m) with r,(m), that is, 

ho (m) * r: (m) = ryx(m) (5.5.4) 
which is a convolutional equation that can be solved by using the z-transform. The complexity of the solution 
depends on the range of m. 

The formula for the MMSE is the same for any filter, either FIR or IIR. Indeed, using Parseval’s theorem and 
(5.5.3), we obtain 





1 1 
P =r (0)-— 9.H (z)R* =i (5.5.5) 
o =r, (0) Znj $. o(2) “(+e dz 
where H(z) is the system function of the optimum filter and R,,(z)=Z{r.(J)}. The power pP, can be computed 
by 


1 A 
= asn R (5.5.6) 
P, =r,(0) aaj $. y(Z)z dz 
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where R (z) = Z{r,(1)}. Combining (5.5.5) with (5.5.6), we obtain 


+({ 1) - 
P, “34 $ [R,(2)-H,(2)Ry, (hi ‘dz (5.5.7) 


which expresses the MMSE in terms of z-transforms. To obtain the MMSE in the frequency domain, we replace z 
by e” .For example, (5.5.5) becomes 


1 . y 
P =r (0)-— | H (e®)R? (e)dæ 
> =O- f EER) 


where H (e/”) is the frequency response of the optimum filter. 


5.5.1 Noncausal IIR Filters 


For the noncausal IIR filter 


$(n) = > h „(k)x(n-k) (5.5.8) 


k=—co 
the range of the Wiener-Hopf equations (5.5.2) is —co<m<oo and can be easily solved by using the convolution 
property of the z -transform. This gives 


H,,(Z)R,(z) = R,, (z) 


R,,(2) 
or H ete ee (5.5.9) 
nc (Z) RO 
where H,,,(Z) is the system function of the optimum filter, R,(z) is the complex PSD of x(n), and Ryx(z) is 
the complex cross-PSD between y(n) and x(n). 


5.5.2 Causal IIR Filters 
For the causal IIR filter 


Hn) = ș h,(k)x(n—k) (5.5.10) 
k=0 


the Wiener-Hopf equations (5.5.2) hold only for m in the range O <m < œœ. Since the sequence r,(m) can be 
expressed as the convolution of h,(m) and r,(m) only for m 20, we cannot solve (5.5.2) using the z -transform. 
However, a simple solution is possible using the spectral factorization theorem.’ This approach was introduced for 
continuous-time processes in Bode and Shannon (1950) and Zadeh and Ragazzini (1950). It is based on the following 
two observations: 

1. The solution of the Wiener-Hopf equations is trivial if the input is white. 

2. Any regular process can be transformed to an equivalent white process. 


White input processes. We first note that if the process x(n) is white noise, the solution of the Wiener-Hopf 
equations is trivial. Indeed, if 
r (D = 076(1) 
Then Equation (5.5.4) gives 


hx (m) 


h,(m) * (m) = ——— O0<m<oco 


x 


'An analogous matrix-based approach is extensively used in Chapter 6 for the design and implementation of optimum FIR filters. 
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which implies that 
—r,(m O0<m<c 
h,(m) =4 07 z(t) (5.5.11) 
0 m <0 


because the filter is causal. The system function of the optimum filter is given by 


1 
A,(z)=—[R,, OL (5.5.12) 
O, 
biia [R,,(z)], = > r, (Dz (5.5.13) 
1=0 
is the one-sided z-transform of the two-sided sequence r,,(/). The MMSE is given by 
RE „0-471 r, (k)? (5.5.14) 
O, k0 


which follows from (5.5.3) and (5.5.11). 
Regular input processes. The PSD of a regular process can be factored as 


R.(z)= o2H (oH =| (5.5.15) 
Z 





where H(z) is the innovations filter (see Section 3.1). The innovations process 


an) = x(n) -Y h (k)æ(n-k) (5.5.16) 
k=l 


is white and linearly equivalent to the input process x(n). Therefore, linear estimation of y(n) basedon x(n) is 
equivalent to linear estimation of y(n) based on @(n). The optimum filter that estimates y(n) from x(n) is 
obtained by cascading the whitening filter 1/ H(z) with the optimum filter that estimates y(n) from a(n) (see 
Figure 5.18). Since @(n) is white, the optimum filter for estimating y(n) from @(n) is 


H.(z)= ZIR], (5.5.17) 


x 


where [R,o(z)]+ is the one-sided z -transform of řye(l). To express H.(z) in terms of Ryx(z), we need the 
relationship between Ryw(Z) and Rjx(z). From 


x(n) = py h,(k)@(n—k) 
k=0 


we obtain 
E{y(n)x'(n—D} = È KKE (n-1-k)} 
k=0 
or (=> k (k)ro(l+k) (5.5.18) 
k=0 


Taking the z-transform of the above equation leads to 


R,, (2) 


Seat ae (5.5.19) 
H; (1/z) 


R a(z) 


which, combined with (5.5.17), gives 
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; 1 R „(2) 
Hla) o? EA 


(5.5.20) 


which is the causal optimum filter for the estimation of y(n) from @(n). The optimum filter for estimating y(n) 


from x(n) is 


1 R (z) 
Hani T IT EA 


which is causal since it is the cascade connection of two causal filters [see Figure 5.19(a)]. 


Optimum filter 


Optimum filter 
for white input 





FIGURE 5.18 
Optimum causal IIR filter design by the spectral factorization method. 


Optimum causal IIR filter 


a & Rx) 
o? |HZ0/z®]|, 


Optimum causal filter ~ | 
for white input | 








(a) 
















R,x(2) 
Hy(1/z*) 





a| 
3 
Whitening Optimum noncausal 

fllter filter for white input | 





b) 


FIGURE 5.19 
Comparison of causal and noncausal IIR optimum filters. 


The MMSE from (5.5.3) can also be expressed as 


P =r 0-4 Frk)? 


O, k0 


(5.5.21) 


(5.5.22) 


which shows that the MMSE decreases as we increase the order of the filter. Table 5.5 summarizes the equations 


required for the design of optimum FIR and IIR filters. 
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TABLE 5.5 
Design of FIR and IIR optimum filters for stationary processes. 
Filter type Solution Required quantities 
FIR e(n) = y(n) —c, x(n) R=([r,(m—k)], d =[r,,(m)] 
c,=R'd O<k, m<M -1, M =finite 
P =r (0)-d"c, 
Noncausal IIR R,,(z) R,(z)=Z{r,)} 
H,,(z) = R (z) R,,(2) =Z{r,)} 
P, =1,(0)— È h, (k)r} (k) 
Causal IIR ie 1 R,,(z) R.(z)=02 H(z) H? (/z*) 
=H @| He) |, 


, R,,(2)=Z{r,O) 
P. =r,(0)— Ere kr (K) 


Finally, since the equation for the noncausal IIR filter can be written as 


1 R,,(Z) 


A R 
O am He) 


(5.5.23) 


we see that the only difference from the causal filter is that the noncausal filter includes both the causal and noncausal 
parts of R,,(z)/H,(z) [see Figure 5.19(b)]. By using the innovations process @(n), the MMSE can be 
expressed as 


1 co 
P =r -F72 Irok) F (5.5.24) 


and is known as the irreducible MMSE because it is the best performance that can be achieved by a linear filter. 
Indeed, since | 7r,.(k) |2 0 , every coefficient we add to the optimum filter can help to reduce the MMSE. 


5.5.3 Filtering of Additive Noise 


To illustrate the optimum filtering theory developed above, we consider the problem of estimating a “useful” or 
desired signal y(n) that is corrupted by additive noise v(n) . The goal is to find an optimum filter that extracts the 
signal y(n) from the noisy observations 

x(n) = y(n) + v(n) (5.5.25) 
given that y(n) and y(n) are uncorrelated processes with known autocorrelation sequences 7,(/) and 4,(/). 


To design the optimum filter, we need the autocorrelation r,(/) of the input signal x(n) and the 
cross-correlation 7,,(/) between the desired response y(n) and the input signal x(n) . Using (5.5.25), we find 


r (D) = E{x(n)x' (n—D} =r, +7, (5.5.26) 


and r,, (1) = E{y(n)x' (n-1)}=r, (D (5.5.27) 
because y(n) and v(n) are uncorrelated. 

The design of optimum IIR filters requires the functions R,(z) and Ry(z). Taking the z -transform of 
(5.5.26) and (5.5.27), we obtain 


R,(Z) = R,(z)+R,(z) (5.5.28) 


and Ryx(z) = Ry (z) (5.5.29) 
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R R 
ge Fe 


ae Se (5.5.30) 
R.(z) R,(z)+R,(2) 


which for z=e!” shows that, for those values of @ for which | R,(e”)| > |R,(e!”)|, that is, for high SNR, we 
have | H,.(e!”)| = 1. In contrast, if |R,(e’’)| < |R,(e’®)], that is, for low SNR, we have | H,,(e!”)| = 0. 
Thus, the optimum filter “passes” its input in bands with high SNR and attenuates it in bands with low SNR, as we 
would expect intuitively. 

Substituting (5.6.30) into (5.6.7), we obtain 


TAE g. et (5.5.31) 
2nj č R,(z)+R,(z) 
which provides an expression for the MMSE that does not require knowledge of the optimum filter. 
We next illustrate the design of optimum filters for the reduction of additive noise with a detailed numerical 
example. 
EXAMPLE 5.5.1. In this example we illustrate the design of an optimum IIR filter to extract a random signal 
with known autocorrelation sequence 


r="  -1<a<l (5.5.32) 


which is corrupted by additive white noise with autocorrelation 
rl) =ô) (5.5.33) 
The processes y(n) and y(n) are uncorrelated. 
Required statistical moments. The input to the filter is the signal x(n) = y(n)+v(n) and the desired response, the signal y(n). 
The first step in the design is to determine the required second-order moments, that is, the autocorrelation of the input process and the 
cross-correlation between input and desired response. Substituting into (5.5.26) and (5.5.27), we have 
r (D) =a" +0 8l) (5.5.34) 


and r =a" (5.5.35) 


To simplify the derivations and deal with “nice, round” numbers, we choose œ = 0.8 and o =1. Then the complex power 
spectral densities of y(n), v(n),and x(n) are 





4 
Qo r Z < |z| < (5.5.36) 
-770-72 
R (z)=0} =1 (5.5.37) 
La 1 
1-— 1-— 
and ial ae (5.5.38) 
* | 44 4 
i= 1—— 
( z7 X 5 z) 
respectively. 
Noncausal filter. Using (5.5.9), (5.5.29), (5.5.36), and (5.5.38), we obtain 
R,.(z2) 9 1 1 
H =—_* = — 2, 
ne (2) R@ 40 < [z| < 


-l ayy! 
(a 27 da 52) 


Evaluating the inverse the Z - transform we have 


168 


and 


Thus, 
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3/1 
h,.(n)=—(—)"  —œ<n < o% 
nc (72) 10 ( > 
which clearly corresponds to a noncausal filter. From (5.5.3), the MMSE is 
3G 1M 4" 3 
| oe > oo =— (5.5.39) 


and provides the irreducible MMSE. 

Causal filter. To find the optimum causal filter, we need to perform the spectral factorization 
R,(z)=02H,(2)H,(z") 

which is provided by (5.5.38) with 














a: (5.5.40) 
* 5 
1 

l-—z"' 
fh i= 2 (5.5.41) 

l-77" 
R,,(z) 0.36 06 032 5.5.42 
ET Got nal BA ade 

5 2 5 2 


where the first term (causal) converges for Izl> 4/5 and the second term (noncausal) converges for | z\<2. Hence, taking the 


causal part 


|u 








sl 
H,(z") |, -fz 


and substituting into (5.5.21), we obtain the causal optimum filter 











jiz 3 
5 5 5 3 1 1 
H.(2)=2 5 5 _3 Iz] < = (5.5.43) 
8 ee -4z 8 {ae 2 
2 5 
The impulse response is 
31 
h(n) ==(—=)"u(n 
.() FACH (n) 
which corresponds to a causal and stable IIR filter. The MMSE is 
= 34 1,;,4 3 
P.=7,(0)— > h (kr.k) =1-- YO) => bay 
k=0 8 i 2 5 8 


which is, as expected, larger than P,.. 
From (5.5.54), we see that the optimum causal filter is a first-order recursive filter that can be implemented by the difference 
equation 


Sin) => Hn) +2 x(n) 


In general, this is possible only when H(z) is a rational function. 
Computation of MMSE using the innovation. We next illustrate how to find the MMSE by using the cross-correlation sequence 
T,o(l) - From (5.5.42), we obtain 


and 


and 


and 
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=o 120 
(I) = 55 (5.5.45) 
yo 3 i 
—2 1<0 
5 


which, in conjunction with (5.5.22) and (5.5.24), gives 


lS, 5 3.205 4.% 3 
P= —— =1--(— =) == 
2 = 1,0) È Tal) ae Xu) s 


O, k= 
P =r Gy > r? (k)- x r? (k) a3 
üi of o? k=0 = =e: am 10 
which agree with (5.5.44) and (5.5.39). 


Noncausal smoothing filter. Suppose now that we want to estimate the value y(n+D) of the desired response from the data 
x(n), —°°<n<oco. Since 


E{ y(n+ D)x(n-1)} =r, (n+ D) (5.5.46) 
Z{r,.(n+ D)} =2°R,,(z) (5.5.47) 
the noncausal Wiener smoothing filter is 
PR (z)  22R,(z) 
E e N a NA ee (5.5.48) 
ne (Z) RO RO z H(z) 
h? (n)=h (n+ D) (5.5.49) 
The MMSE is 
P? =r (0)- X hy. (k+ D)r,(k+D)=P, (5.5.50) 


k=-<0 
which is independent of the time shift D. 
Causal prediction filter. We estimate the value y(n+D)(D>0) of the desired response using the data x(k), -o<k<n. 
The whitening part of the causal prediction filter does not depend on y(n) and is still given by (5.5.41). The coloring part 
depends on y(n+D) and is given by Ryo(Z) =2?Ryo(Z) or Toll) = Tyo(1 + D) . Taking into consideration that D > 0 , we 
can show (see Problem 5.31) that the system function and the impulse response of the causal Wiener predictor are 


aes (2) 32) 
POETE ME || S\5)_ |_ 8s (5.5.51) 
i 8 yet a y4 yt 
5 2 
3.451 
pra Boel (5.5.52) 
. (n) 3°5 o u(n) 


repectively. This shows that as D —> œ% , the impulse response pe (n) — 0, which is consistent with our intuition that the 


prediction is less and less reliable. The MMSE is 
pe = 2 erF (2) =] 2 ay (5.5.53) 
to 5 8 5 


and P!”!-57,(0)=1 as D-— 00 , which agrees with our earlier observation. For D=2 , the MMSE is 
P2?! = 93/125 = 0.7440 > P. , as expected. 

Causal smoothing filter. To estimate the value y(n+ D) (D <0) of the desired response using the data x(n), -o<k<n, 
we need a smoothing Wiener filter. The derivation, which is straightforward but somewhat involved, is left for Problem 5.32. The 
system function of the optimum smoothing filter is 
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H!")(z) =; ee (5.5.54) 
ara 1-5" 1——z" 


where D <0 . To find the impulse near for D =-—2 , we invert (5.5.54). This gives 


Hk) => Žan ók- 1)+ ao 2u(k-2) (5.5.55) 


and if we express Ty (k—2) ina similar form, we can compute the MMSE 
pt aj a sty (5.5.56) 
50 400 128 3 128 
which is less than P. = 0.375 . This should be expected since the smoothing Wiener filter uses more information than the Wiener 
filter (i.e., when D =Q). In fact it can be shown that 


lim P =P. and Jim hP! (n) =h (n) (5.5.57) 


D->-<o 


which is illustrated in Figure 5.20. Figure 5.21 shows the impulse responses of the various optimum IIR filters designed in this 
example. Interestingly, all are obtained by shifting and truncating the impulse response of the optimum noncausal IIR filter. 


1.000 
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FIGURE 5.20 
MMSE as a function of the time shift D. 
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FIGURE 5.21 
Impulse response of optimum filters for pure filtering, prediction, and smoothing. 
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FIR filter. The M th-order FIR filter is obtained by solving the linear system 


Rh=d 
where R =Toeplitz(l+o?, a, =, a") 
and d=[l Q xe otf 


The MMSE is 
M -1 
P, =r (0)-}, h,(k)r,(k) 
k=0 


and is shown in Figure 5.22 as a function of the order M together with Pc and Pac. We notice that an optimum FIR filter of order 
M=4 provides satisfactory performance. This can be explained by noting that the impulse response of the causal optimum HR filter 
is negligible for n > 4. 


L L —L 
1 2 3 4 5 
FIR filter order M 








FIGURE 5.22 
MMSE as a function of the optimum FIR filter order M. 


5.5.4 Linear Prediction Using the Infinite Past—Whitening 


The one-step forward IIR linear predictor is a causal IIR optimum filter with desired response y(n) = x(n+1). The 
prediction error is 


e (n+1) =x(nt)-> h, (k)x(n-k) (5.5.58) 
k=0 
where H,,(2)= Yh, (k)z* (5.5.59) 
k=0 


is the system function of the optimum predictor. Since y(n)=x(n+1), we have ry.<(J)=7,(1+1) and Ry.(z) = zR,(z). 
Hence, the optimum predictor is 


2 -$ A 
E T. ai EHO, _ 2H,(@)-z 


o: H,(z) H Az") H(z) H,(z) 
and the prediction error filter (PEF) is 


Ef (2) 


(5.5.60) 
X (z) 


Hpgr(z)= 








=|—7'H = 
Z p(z) H(z) 


that is, the one-step IIR linear predictor of a regular process is identical to the whitening filter of the process. 

Therefore, the prediction error process is white, and the prediction error filter is minimum-phase. We will see that the 

efficient solution of optimum filtering problems includes as a prerequisite the solution of a linear prediction problem. 

Furthermore, algorithms for linear prediction provide a convenient way to perform spectral factorization in practice. 
The MMSE is 
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1 1 
Pf =— R,(z)-z} 1- R| — [pz 
Znj §.| = l zo! (=) à 








1 1 
=— Q R,(z)——7z"! (5.5.61) 
Oni $. (z) H.O z`dz 
-o2 l {1 \ -id-02 
=o? aaj A n:( z) dz=0 
1 ef a i 
because — © A,| —|z dz=h,(0)=1 
2nj 1c gE 
From (5.5.61) we have 
f 2 1 ja 
P, =0, =exp —[ In R,(e*")d@ (5.5.62) 
2n°* 


which is known as the Kolmogorov-Szegé formula. 
We can easily see that the D-step predictor ( D > 0) is given by 


[z H, 2), _ 1 > het? (5.5.63) 


H = 
=" AAG 


but is not guaranteed to be minimum-phase for D #1. 
EXAMPLE 5.5.2. Consider a minimum-phase AR(2) process 
x(n) = a,x(n—1)+a,x(n—2)+ @(n) 
where @(n) ~ WN(0, o) . The complex PSD of the process is 
o; 
AZA) 
where A(z)21—a,z'—a)z” and ø? = g2. The one-step forward predictor is given by 


R (z) 2 o? H,(z)H,(z') 


Z 
H(z) 


or X(n+1) =a,x(n)+a,x(n-1) 





H,,(z)=z- = z-7A(z)=a,+a,z" 


as should be expected because the present value of the process depends only on the past two values. Since the excitation w(n) is 
white and cannot be predicted from the present or previous values of the signal x(n), it is equal to the prediction error e'(n). 
Therefore, oO = ØŻ , as expected from (5.5.62). This shows that the MMSE of the one-step linear predictor depends on the SFM 
of the process x(n) . It is maximum for a white noise process, which is clearly unpredictable. 
Predictable processes. A random process x(n) is said to be (exactly) predictable if P, = E{|e'(n)//}}=0. We 
next show that a process x(n) is predictable if and only if its PSD consists of impulses, that is, 


R,(e”) =>) A,6(@-@,) (5.5.64) 
k 


or in other words, x(n) is a harmonic process. For this reason harmonic processes are also known as deterministic 
processes. From (5.5.60) we have 


P = Efle'(n)}= [| Hre)? R (e° )do (5.5.65) 


where Hpgr(et®) is the frequency response of the prediction error filter. Since R, (ef?) > 0 , the integral in (5.5.65) 
is zero if and only if | Hpg (e?) R,(e!”) =0. This is possible only if R,(e!”) is a linear combination of 
impulses, as in (5.5.64), and e’* are the zeros of Hpgr(z) on the unit circle (Papoulis 1985). 

From the Wold decomposition theorem (see Section 3.1.3) we know that every random process can be 
decomposed into two components that are mutually orthogonal: (1) a regular component with continuous PSD that 
can be modeled as the response of a minimum-phase system to white noise and (2) a predictable process that can be 
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exactly predicted from a linear combination of past values. This component has a line PSD and is essentially a 
harmonic process. A complete discussion of this subject can be found in Papoulis (1985, 1991) and Therrien (1992). 


5.6 Inverse Filtering and Deconvolution 


In many practical applications, a signal of interest passes through a distorting system whose output may be corrupted 
by additive noise. When the distorting system is linear and time-invariant, the observed signal is the convolution of 
the desired input with the impulse response of the system. Since in most cases we deal with linear and time-invariant 
systems, the terms filtering and convolution are often used interchangeably. 

Deconvolution is the process of retrieving the unknown input of a known system by using its observed output. If 
the system is also unknown, which is more common in practical applications, we have a problem of blind 
deconvolution. The term blind deconvolution was introduced in Stockham et al. (1975) for a method used to restore 
old records. Other applications include estimation of the vocal tract in speech processing, equalization of 
communication channels, deconvolution of seismic data for the elimination of multiple reflections, and image 
restoration. 

The basic problem is illustrated in Figure 5.23. The output of the unknown LTI system G(z), which is assumed 
BIBO stable, is given by 


x(n) = jy g(k)æ(n—k) (5.6.1) 


k= 


where @(n) ~ IID(0,0;) is a white noise sequence. Suppose that we observe the output x(n) and that we wish to 
recover the input signal @(n), and possibly the system G(z), using the output signal and some statistical 


information about the input. 
w(n) x(n) y(n) 
G(z) H(z) 
Unknown 


input Unknown Deconvolution 
system filter 


FIGURE 5.23 
Basic blind deconvolution model. 

If we know the system G(z), the inverse system H(z) is obtained by noticing that perfect retrieval of the 
input is possible if 


h(n) * g(n)* w(n) = bw(n -m ) (5.6.2) 
where b and m are constants. From (5.6.2), we have h(n) * g(n) =b)d(n—ng), or equivalently 
Zz” 
H(z) =b, —— (5.6.3) 
mere G(z) 


which provides the system function of the inverse system. The input can be recovered by convolving the output with 
the inverse system H (z). Therefore, the terms inverse filtering and deconvolution are equivalent for LTI systems. 
There are three approaches for blind deconvolution: 
* Identify the system G(z), design its inverse system H(z), and then compute the input w(n). 
e Identify directly the inverse H(z)=1/G(z) of the system, and then determine the input w(n). 
° Estimate directly the input w(n) from the output x(n). 

Any of the above approaches requires either directly or indirectly the estimation of both the magnitude response 
|G(e”)| and the phase response 4 .G(e'”) of the unknown system. In practice, the problem becomes more 
complicated because the output x(n) is usually corrupted by additive noise. If this noise is uncorrelated with the 
input signal and the required second-order moments are available, we show how to design an optimum inverse filter 
that provides an optimum estimate of the input in the presence of noise. 

We now discuss the design of optimum inverse filters for linearly distorted signals observed in the presence of 
additive output noise. The typical configuration is shown in Figure 5.24. Ideally, we would like the optimum filter to 
restore the distorted signal x(n) to its original value y(n). However, the ability of the optimum filter to attain ideal 
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performance is limited by three factors. First, there is additive noise v(n) at the output of the system. Second, if the 
physical system G(z) is causal, its output s(n) is delayed with respect to the input, and we may need some delay 
z7? to improve the performance of the system. When G(z) is a non-minimum-phase system, the inverse system is 
either noncausal or unstable and should be approximated by a causal and stable filter. Third, the inverse system may 
be IIR and should be approximated by an FIR filter. 

y(n—D) 





FIGURE 5.24 
Typical configuration for optimum inverse system modeling. 
The optimum inverse filter is the noncausal Wiener filter 


ZR, (2) 
R,(z) 


H(z) = (5.6.4) 


where the term z`? appears because the desired response is yp(n)Ê y(n—D). Since y(n) and v(n) are 
uncorrelated, we have 


Ry. (z) = Rye (Z) (5.6.5) 
a“ R,(z) = G(z)G* (=) R,(z)+R,(z) (5.6.6) 
2 ) 
The cross-correlation between y(n) and s(n) 
R„(z)=G* (>) R,(z) (5.6.7) 
P ) 


is obtained by using Equation (5.5.18). Therefore, the optimum inverse filter is 


z°G*(1/ 2")R, (z) 





H2) rr ene (5.6.8) 
G(z)G"(1/ z°)R,(z) + R, (z) 
which, in the absence of noise, becomes 
-D 
H,,(z)=— (5.6.9) 
G(z) 


as expected. The behavior of the optimum inverse system is illustrated in the following example. 
EXAMPLE 5.6.1 Let the system G(z) be an all-zero non-minimum-phase system given by 


1 3 1 
== (— =927) === S771 (2= 2 
G(z) z 3z+7-2z) z. 37 Xz-2) 


Then the inverse system is given by 


5 1 1 
H(z =G"! z) = ———— = — -—-—_— 
@) @) —3z+7-2z2" yi 1-2z"' 
3 


which is stable if the ROC is —1/3 < |z| < 2. Therefore, the impulse response of the inverse system is 
1 n 
= n20 
nG) 
2r n<0 


which is noncausal and stable. 
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Following the discussion given in this section, we want to design an optimum inverse system given that G(z) is driven by a 


white noise sequence y(n) and that the additive noise y(n) is white, thatis, R,(z)=o% and R,(z)= 0%. From (5.6.8), the 


optimum noncausal inverse filter is given by 


H (2) = 


GOH GENE o) 


which can be computed by assuming suitable values for variances OF and ø?. Note that if g? <7, that is, for very large 


SNR, we obtain (5.6.9). 


0.25 


0.15 








0.1 
0123 45 67 8 9 
Delay D 
FIGURE 5.25 
The inverse filtering MMSE as a function of delay D. 
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FIGURE 5.26 
Impulse responses of optimum inverse filters. 


A more interesting case occurs when the optimum inverse filter is FIR, which can be easily implemented. To design this FIR 
filter, we will need the autocorrelation r,(J) and the cross-correlation ”,„x (J), where yp(n) = y(n — D) is the delayed system input 


sequence. Since 
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R,(z) =0,G(z)G(z ') +o; 


and R, =0,z G(z") 
we have (see Section 2.2.1) 


rD = g(l)* g*r (+r D =o lg) * g(-D)+0,6() 


and r „(D = g(-l)* 1, (1— D) = 038 (-1+ D) 
repectively. Now we can determine the optimum FIR filter kp of length M by constructing an M XM Toeplitz matrix R from 
r,(1) andan M x1 vectord from r,,(/) and then solving 

Rhp=d 
for various values of D. We can then plot the MMSE as a function of D to determine the best value of D (and the corresponding 
FIR filter) which will give the smallest MMSE. For example, if O > =l,o7 =0.1, and M =10, the correlation functions are 


6 7 19 7 6 2 7 3 

25° 5° 50 5” 25 land BF 5 
r()= + ae g = 7 
l=0 l=D 


The resulting MMSE as a function of D is shown in Figure 5.25, which indicates that the best value of D is approximately M / 2. 
Finally, plots of impulse responses of the inverse system are shown in Figure 5.26. The first plot shows the noncausal h(n), the 
second plot shows the causal FIR system (n) for D =Q , and the third plot shows the causal FIR system hp(n) for D=5. 
It is clear that the optimum delayed FIR inverse filter for D =~ M /2 closely matches the impulse response of the inverse filter 
h(n) - 


5.7 Summary 


In this chapter, we discussed the theory and application of optimum linear filters designed by minimizing the MSE 
criterion of performance. Our goal was to explain the characteristics of each criterion, emphasize when its use made 
sense, and illustrate its meaning in the context of practical applications. 

We started with linear processors that formed an estimate of the desired response by combining a set of different 
signals (data) and showed that the parameters of the optimum processor can be obtained by solving a linear system of 
equations (normal equations). The matrix and the right-hand side vector of the normal equations are completely 
specified by the second-order moments of the input data and the desired response. Next, we used the developed 
theory to design optimum FIR filters, linear signal estimators, and linear predictors. 

We emphasized the case of stationary stochastic processes and showed that the resulting optimum estimators are 
time-invariant. Therefore, we need to design only one optimum filter that can be used to process all realizations of the 
underlying stochastic processes. Although another filter may perform better for some realizations, that is, the 
estimated MSE is smaller than the MMSE, on average (i.e., when we consider all possible realizations), the optimum 
filter is the best. 

We showed that the performance of optimum linear filters improves as we increase the number of filter 
coefficients. Therefore, the noncausal IIR filter provides the best possible performance and can be used as a yardstick 
to assess other filters. Because IIR filters involve an infinite number of parameters, their design involves linear 
equations with an infinite number of unknowns. For stationary processes, these equations take the form of a 
convolution equation that can be solved using Z -transform techniques. If we use a pole-zero structure, the normal 
equations become nonlinear and the design of the optimum filter is complicated by the presence of multiple local 
minima. 

Then we discussed the design of optimum filters for inverse system modeling and blind deconvolution, and we 
provided a detailed discussion of their use in the important practical application of channel equalization for data 
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transmission systems. 


Problems 


5.1 


Let x be a random vector with mean E{x}. Show that the linear MMSE estimate } of a random variable y using the data 
vector x is given by }=y,+c4%x,where y, =E{y}—c"E{x}, c=R d, R=E{xx"},and d=E{xy’}. 


5.2 Consider an optimum FIR filter specified by the input correlation matrix R= Toeplitz {1, 1/4} and cross-correlation vector 


5.3 
5.4 


55 


5.6 


3:7 
5.8 
39 


d =[1 1/27". 
(a) Determine the optimum impulse response ¢, and the MMSE P,. 
(b) Express c, and P, in terms of the eigenvalues and eigenvectors of R. 

Repeat Problem 5.2 for a third-order optimum FIR filter. 

A process y(n) with the autocorrelation y,(/) = a'|,—1<a<1, is corrupted by additive, uncorrelated white noise y(n) with 

variance ø? . To reduce the noise in the observed process x(n) = y(n) + v(n) , we use a first-order Wiener filter. 

(a) Express the coefficients ¢,, and c,, andthe MMSE P, interms of parameters a and g’: 

(b) Compute and plot the PSD of x(n) and the magnitude response |C,(e}?)| of the filter when øg? = 2, for both a=0.8 

and a {= -0.8, and compare the results. 

(c) Compute and plot the processing gain of the filter for a =—0.9, —0.8, —0.7, ---, 0.9 asafunction of @ and comment on 

the results. 

Consider the harmonic process y(n) and its noise observation x(n) given in Example 5.3.1. 

(a) Show that _r,(/)=1/2 A? cos æl - 
(b) Write a Mat lab function h = opt_fir(A, £0, var_v,M) to design an M th-order optimum FIR filter impulse response 
h(n). Use the toep1 itz function from MATLAB to generate correlation matrix R. 
(c) Determine the impulse response of a 20th-order optimum FIR filter for A=0.5, fy =0.05,and øg? =0.5. 
(d) Using MATLAB, determine and plot the magnitude response of the above-designed filter, and verify your results with those given 
in Example 5.3.1. 

Consider a “desired” signal s(n) generated by the process s(n)=—0.8w(n—1)+w(n), where w(n)~ WN(0,02). This 
signal is passed through the causal system H(z)=1—0.9z' whose output y(n) is corrupted by additive white noise 
v(n) ~ WN (0,0?) . The processes w(n) and y(n) are uncorrelated with g} =0.3 and g? =0.1. 

(a) Design a second-order optimum FIR filter that estimates s(n) from the signal x(n)= y(n)+v(n) and determine c, and 

P. 

(b) Plot the error performance surface, and verify that it is quadratic and that the optimum filter points to its minimum. 

(c) Repeat part (a) for a third-order filter, and see whether there is any improvement. 

Repeat Problem 5.6, assuming that the desired signal is generated by s(n) =—0.8s(n—1)+ w(n) . 

Repeat Problem 5.6, assuming that H (z)=1. 

A stationary process x(n) is generated by the difference equation x(n) = px(n—1)+w(n), where w(n)~ WN(0, 02). 

(a) Show that the correlation matrix of x(n) is given by 
2 


R, =—2"Toeplitz{l, p, p?, . p”) 


x 1 pP 





(M) 


(b) Show that the M th-order FLP is given by a” =—p, a," =0 for k >l andthe MMSEis Pf =02. 
M w 


5.10 Using Parseval’s theorem, show that (5.3.18) can be written as (5.4.21) in the frequency domain. 
5.11 By differentiating (5.3.21) with respect to H(e!”), derive the frequency response function H,(ei®) of the optimum filter in 


terms of R (e) and R,(e!). 


5.12 A conjugate symmetric linear smoother is obtained from (5.4.12) when M =2L and i= L. If the process x(n) is stationary, 


then, using RJ = JR’ » show that = = Je". 


5.13 Let Q and A be the matrices from the eigendecomposition of R , thatis, R = OKO" s 


(a) Substitute R into (5.4.20) and (5.4.27) to prove (5.4.43) and (5.4.44). 
(b) Generalize the above result for a jth-order linear signal estimator c” (n) ; that is, prove that 


M+ 1 


c®(n)=P? (n), — 
i=l Ai 


qfi 
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5.14 


5.15 


5.16 


5.17 


5.18 


5.19 


5.20 


5.21 


5:22 


5.23 


Let R(n) be the inverse of the correlation matrix R(n) given in (5.4.11). 
(a) Using (5.4.12), show that the diagonal elements of R(n) are given by 


<ita)>,-—— <M+l 
" P(n) 
(b) Furthermore, show that 
e TATE -a _ <i<M+l 
< R(n) >; 


where Ř;(n) isthe i-th column of R(n). 
The first five samples of the autocorrelation sequence of a signal x(n) are r(0)=1, r(1)=0.8, r(2)=0.6, r(3)=0.4, andr(4) =0.3. 
Compute the FLP, the BLP, the optimum symmetric smoother, and the corresponding MMSE (a) by using the normal equations 
method and (b) by using the inverse of the normal equations matrix. 
For the symmetric, Toeplitz autocorrelation matrix R= Toeplitz {r(0), r(1), r(2)}=r(O)x Toeplitz {1, p, p} with 
R=LDL" and D= diag{é, &, Š}, the following conditions are equivalent: 
* R is positive definite. 
. E > 0 for l <i <3. 
e |k,|<1 for 1sis3. 
Determine the values of p, and p, for which R is positive definite, and plot the corresponding area in the (p,, ,) plane. 
Prove the first equation in (5.4.45) by rearranging the FLP normal equations in terms of the unknowns 
Pf (n), a(n), =+, @y(n) and then solve for Pf (n), using Cramer’s rule. Repeat the procedure for the second equation. 
Consider the signal x(n) = y(n)+v(n), where y(n) is a useful random signal corrupted by noise y(n). The processes y(n) 
and y(n) are uncorrelated with PSDs 


Nia 


and R, (e!”) = 


© sla 


0 


respectively. (a) Determine the optimum IIR filter and find the MMSE. (b) Determine a third order optimum FIR filter and the 
corresponding MMSE. (c) Determine the noncausal optimum FIR filter defined by 

$n) =h(—1)x(n+1)+h(0)x(n)+h(1)x(n—1) 
Consider the ARMA (1,1) process x(n) =0.8x(n—1)+w(n)+0.5w(n—1), where w(n)~ WGN (0,1). (a) Determine the 
coefficients and the MMSE of (1) the one-step ahead FLP X(n)=a,x(n—1)+a,x(n—2) and (2) the two-step ahead FLP 
ŝ(n+1)=ax(n—1)+ax(n—2). (b) Check if the obtained prediction error filters are minimum-phase, and explain your 
findings. 

Consider a random signal x(n)=s(n)+v(n) , where v(n)~WGN(0,1) and s(n) is the AR(1) process 
s(n) =0.9s(n—1)+w(n), where w(n)~WGN(0, 0.64). The signals s(n) and y(n) are uncorrelated. (a) Determine and 
plot the autocorrelation ,,(/) andthe PSD R (e?) of s(n). (b) Design a second-order optimum FIR filter to estimate s(n) 
from x(n) . What is the MMSE? (c) Design an optimum IIR filter to estimate s(n) from x(n). What is the MMSE? 

A useful signal s(n) with PSD R.(z)=[(1—0.9z')(—0.9z)]"' is corrupted by additive uncorrelated noise 
v(n) ~ WN(0,o7) . (a) The resulting signal x(n)=5s(n)+v(n) is passed through a causal filter with system function 
H(z)=( -—0.8z7)". Determine (1) the SNR at the input, (2) the SNR at the output, and (3) the processing gain, that is, the 
improvement in SNR. (b) Determine the causal optimum filter and compare its performance with that of the filter in (a). 

A useful signal s(n) with PSD R,(z)=0.36[(1—0.8z7')(1—0.8z)]"'_ is corrupted by additive uncorrelated noise 

v(n) ~ WN (0,1). Determine the optimum noncausal and causal IIR filters, and compare their performance by examining the 

MMSE and their magnitude response. Hint: Plot the magnitude responses on the same graph with the PSDs of signal and noise. 
Consider a process with PSD R,(z) =o°H,(z)H,(z_'). Determine the D-step ahead linear predictor, and show that the MMSE 
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is given by P\) = g? Eo h? (n) . Check your results by using the PSD R (z) =(1- a’)[((—az')(1—az)]'- 


5.24 Let x(n)=s(n)+v(n) with R,(z)=1, Ry(z)=0,and 
0.75 
(1-0.5z ')(1-0.5z) 
Determine the optimum filters for the estimation of s(n) and s(n—2) from {x(k)}".. and the corresponding MMSEs. 
5.25 For the random signal with PSD 


R,(z) = 


_d —0.2z')(1-0.2z) 
(1-0.927')(1-0.9z) 
determine the optimum two-step ahead linear predictor and the corresponding MMSE. 
5.26 Repeat ProblemS.25 for 


R,(Z) 


1 
(1—0.2z')(1—0.2z)(1—-0.9z')(1—0.9z) 


5.27 Let x(n)=s(n)+v(n) with v(n)~WN(0,1) and s(n) =0.6s(n—1)+w(n), where w(n) ~ WN(0,0.82) - The processes 
s(n) and y(n) are uncorrelated. Determine the optimum filters for the estimation of s(n), s(n+2),and s(n—2) from 
{x(k)}"., and the corresponding MMSEs. 

5.28 Repeat Problem 5.27 for R,(z) =[(1—0.5z')(1—0.5z)]'. R,(z)=5,and R,,(z)=0. 

5.29 Consider the random sequence X(n) generated in Example 5.4.2 


R,(z)= 


x(n) = w(n) + mn —-1) 


where w(n) is WN(0,1). Generate K sample functions {w,(n)}%), k=1, ---, K of @(n), in order to generate K 
sample functions {x,(n)}", k=1, =, K of x(n). 


(a) Use the second-order FLP a, to obtain predictions {%,()}%_. of x(n), for k =1, K . Then 


determine the average error 


af, 


A 1 č 
P =—— Y xmn- k=1,,K 
N -1 n=2 


and plot it as a function of k. Compare it with Pf. 
(b) Use the second-order BLP b, to obtain predictions {%°(n)}"7, k=1, 


n=0 > 


K of x(n). Then determine 


the average error 
1 X2 s 
P =W Xixa- en) P k=l, --, K 
~ + n=0 


and plot it as a function of k. Compare it with Pè. 
(c) Use the second-order symmetric linear smoother C, to obtain smooth estimates {%{(n)}‘=> of x(n) for k=1,---,K. 
Determine the average error 


wi 1 N-I 
P=—— Y lam- k=l, =, K 
N-li 


and plot it as a function of k . Compare it with P5. 
5.30 Let x(n)= y(n)+v(n) be a wide-sense stationary process. The linear, symmetric smoothing filter estimator of y(n) is given 


by 


L 
5(n) = J, h(k)x(n—k) 
k=-L 
(a) Determine the normal equations for the optimum MMSE filter. 
(b) Show that the smoothing filter c$ has linear phase. 
(c) Use the Lagrange multiplier method to determine the MMSE M th-order estimator $(n)=c"x(n), where M =2L+1, 
when the filter vector ¢ is constrained to be conjugate symmetric, that is, ¢ = Jc” . Compare the results with those obtained in 
part (a). 
5.31 Consider the causal prediction filter discussed in Example 5.5.1 To determine H!?1(z), first compute the causal part of the 
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Z -transform [R;»(Z)], - Next compute H!!(z) by using (5.5.21). 
(a) Determine Al?!(n). 


(b) Using the above AlP!(n), show that 


54 
pl?) =1—-2 (2) 
c 3 ( z 
5.32 Consider the causal smoothing filter discussed in Example 5.5.1. 
(a) Using [Ho(l)]+ = ro(l+ D)u(l), D <0, show that [7,(/)]. can be put in the form 


KDL a(S)" u(l+ D)+=(2!?ylu(l)—u(l +D) D<0 


(b) Hence, show that [R;o(z)]+ is given by 





; 3 z” 3 D 41-1 
R_,(z)], == S8 2'z 
[R,(Z)] EE zí D> 
5 


(c) Finally using (5.5.21), prove (5.5.54). 
5.33 In this problem, we will prove (5.5.57) 


(a) Starting with (5.5.42), show that [Ryo(z)]+ can also be put in the form 





i 3 z 

R = —— 

[Ryo(Z)], 3] | a ae 
= 


(b) Now, using (5.5.21), show that 


3 2-47) 4324 
H= — > ___3 
d- z')-2z") 


hence, show that 


D 
im H(z) => | — 2 _] 272) 
G——7 1-22") 
5 4 


(c) Finally, show that lim PP! = Pe- 


Deco 
5.34 Consider the block diagram of a simple communication system shown in Figure 5.27. The information resides in the signal s(n) 
produced by exciting the system H,(z)=1/(1+0.95z') with the process a@n)~WGN(0, 0.3). The signal s(n) 
propagates through the channel H(z) =1/(1—0.85z7'), and is corrupted by the additive noise process y(n) ~ WGN (0,0.1), 
which is uncorrelated with q@(n) . (a) Determine a second-order optimum FIR filter (M =2) that estimates the signal s(n) 
from the received signal x(n) = z(n)+v(n) . What is the corresponding MMSE P, ? (b) Plot the error performance surface and 
verify that the optimum filter corresponds to the bottom of the “bowl.” (c) Use a Monte Carlo simulation (100 realizations with a 
1000-sample length each) to verify the theoretically obtained MMSE in part (a). (d) Repeat part (a) for M =3 and check if 


there is any improvement. Hint: To compute the autocorrelation of z(n), notice that the output of H,(z)H2(z) is an AR(2) 
process. 





Optimum 
w(n) filter e(n) 
FIGURE 5.38 


Block diagram of simple communication system used in Problem 5.34. 


5.35 


5.36 
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Determine the matched filter for the deterministic pulse s(n) =cos @ nfor 0 < n < M —1 and zero elsewhere when the 
noise is (a) white with variance g? and (b) colored with autocorrelation r,(/)=o2p!/(1—p?), —1 < p < 1. Plot the 
frequency response of the filter and superimpose it on the noise PSD, for @ =2/6, M =12, o? =1,and p=0.9. Explain 
the shape of the obtained response. (c) Study the effect of the SNR in part (a) by varying the value of ø? . (d) Study the effect of 
the noise correlation in part (c) by varying the value of 2 . 

In this problem we formulate the design of optimum linear signal estimators (LSE) using a constrained optimization framework. 

To this end we consider the estimator e(n) = c}x(n)+---+cj,x(n—M) = c"x(n) and we wish to minimize the output power 
E{| e(n) |’ } =e Re. To prevent the trivial solution ¢ =0 we need to impose some constraint on the filter coefficients and use 

Lagrange multipliers to determine the minimum. Let U; bean M x1 vector with one at the ith position and zeros elsewhere. (a) 

Show that minimizing ¢’Re under the linear constraint we =1 provides the following estimators: FLP if i=0, BLP if 
i=M , and linear smoother if i#0,M . (b) Determine the appropriate set of constraints for the L-steps ahead linear predictor, 

defined by C) =1 and {ce, = 0}; | and solve the corresponding constrained optimization problem. Verify your answer by 

obtaining the normal equations using the orthogonality principle. (c) Determine the optimum linear estimator by minimizing 
ec’ Re under the quadratic constraints cř#c=1 and ¢We=1 (Wis a positive definite matrix) which impose a constraint on 


the length of the filter vector. 
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CHAPTER 6 


Algorthms and Structures 
for Optimum Linear Filters 


The design and application of optimum filters involves (1) the solution of the normal equations to determine the 
optimum set of coefficients, (2) the evaluation of the cost function to determine whether the obtained parameters 
satisfy the design requirements, and (3) the implementation of the optimum filter, that is, the computation of its 
output that provides the estimate of the desired response. 

The normal equations can be solved by using any general-purpose routine for linear simultaneous equations. 
However, there are several important reasons to study the normal equations in greater detail in order to develop 
efficient, special-purpose algorithms for their solution. First, the throughput of several real-time applications can only 
be served with serial or parallel algorithms that are obtained by exploiting the special structure (e.g., Toeplitz) of the 
correlation matrix. Second, sometimes we can develop order-recursive algorithms that help us to choose the correct 
filter order or to stop the algorithm before the manifestation of numerical problems. Third, some algorithms lead to 
intermediate sets of parameters that have physical meaning, provide easy tests for important properties (e.g., 
minimum phase), or are useful in special applications (e.g., data compression). Finally, sometimes there is a link 
between the algorithm for the solution of the normal equations and the structure for the implementation of the 
optimum filter. 

In this chapter, we present different algorithms for the solution of the normal equations, the computation of the 
minimum mean square error (MMSE), and the implementation of the optimum filter. We start in Section 6.1 with a 
discussion of some results from matrix algebra that are useful for the development of order-recursive algorithms and 
introduce an algorithm for the order-recursive computation of the LDL” decomposition, the MMSE, and the optimum 
estimate in the general case. 

The only assumption we have made so far is that we know the required second-order statistics; hence, the results 
apply to any linear estimation problem: array processing, filtering, and prediction of nonstationary or stationary 
processes. In the sequel, we impose additional constraints on the input data vector and show how to exploit them in 
order to simplify the general algorithms and structures or specify new ones. In Section 6.3, we explore the shift 
invariance of the input data vector to develop a time-varying lattice-ladder structure for the optimum filter. However, 
to derive an order-recursive algorithm for the computation of either the direct or lattice-ladder structure parameters of 
the optimum time-varying filter, we need an analytical description of the changing second-order statistics of the 
nonstationary input process. Recall that in the simplest case of stationary processes, the correlation matrix is constant 
and Toeplitz. As a result, the optimum FIR filters and predictors are time-invariant, and their direct or lattice-ladder 
structure parameters can be computed (only once) using efficient, order-recursive algorithms due to Levinson and 
Durbin (Section 6.4). Section 6.5 provides a derivation of the lattice-ladder structures for optimum filtering and 
prediction, their structural and statistical properties, and algorithms for transformations between the various sets of 
parameters. 


6.1 Fundamentals of Order-Recursive Algorithms 


The optimum estimate is computed as a sum of products using a linear combiner supplied with the optimum 
coefficients and the input data. The key characteristic of this approach is that the order of the estimator should be 
fixed initially, and in case we choose a different order, we have to repeat all the computations. Such computational 
methods are known as fixed-order algorithms. 
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When the order of the estimator becomes a design variable, we need to modify our notation to take this into 
account. For example, the m th-order estimator c,,(n) is obtained by minimizing E{|e,,(n) |}, where 


e,, (n) = y(n) — §,,(n) (6.1.1) 
5,,(0) = 0% (n)x,,(n) (6.1.2) 

cp (n) Ê [c (n) c9” (n) = e)" (6.1.3) 
x,,(n) =[x,(n) x,(n) = x, (n) (6.1.4) 


In general, we use the subscript m to denote the order of a matrix or vector and the superscript m to emphasize that a 
scalar is a component of an mX1 vector. We note that these quantities are functions of time n, but sometimes we do 
not explicitly show this dependence for the sake of simplicity. 

If the m th-order estimator c,,(n) has been computed by solving the normal equations, it seems to be a waste of 
computational power to start from scratch to compute the (m+1)st-order estimator c,,,,(m). Thus, we would like to 
arrange the computations so that the results for order m , that is, c,,(m) or ĵ„(n), can be used to compute the 
estimates for order m+1, that is, Cm (n) OF na(n). The resulting procedures are called order-recursive algorithms 
or order-updating relations. Similarly, procedures that compute c,,(n+1) from ¢,(n) or },(n+1) from j¥, (7) 
are called time-recursive algorithms or time-updating relations. Combined order and time updates are also possible. All 
these updates play a central role in the design and implementation of many optimum and adaptive filters. 

In this section, we derive order-recursive algorithms for the computation of the LDL” decomposition, the 
MMSE, and the MMSE optimal estimate. We also show that there is no order-recursive algorithm for the 
computation of the estimator parameters. 


6.1.1 Matrix Partitioning and Optimum Nesting 


We start by introducing some notation that is useful for the discussion of order-recursive algorithms.’ Notice that if 
the order of the estimator increases from m to m+1, then the input data vector is augmented with one additional 
observation x,,,;. We use the notation xl” to denote the vector that consists of the first m components and xlr] 
for the last m components of vector x,,,;. The same notation can be generalized to matrices. The mXm matrix 
Ri"! , obtained by the intersection of the first m rows and columns of R,,,;, is known as the mth-order ra a 
principal submatrix of R». In other words, if 7; are the elements of R,,,,, then the elements of Ri") ar 


Ttj,1 <i, j <m . Similarly, Ri"! denotes the matrix obtained by the intersection of the last m rows and columns 
R m+ . For example, if m=3 we obtain 


RÍ] 














R, = (6.1.5) 
which illustrates the upper left corner and lower right corner partitionings of matrix R4, 
Since xim| = X,, , we can easily see that the correlation matrix can be partitioned as 
Xp Ro Ta 
R„„ =E [xe gahl (6.1.6) 
Xm Ta Pa 
where SE E (6.1.7) 


‘All quantities in Sections 6.1 and 6.2 are functions of the time index n. However, for notational simplicity we do not explicitly show this 


dependence. 
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b A 2 
and Pa = FAl Xna l} (6.1.8) 


The result ximl=y > R, = R” (6.1.9) 


m+1 m+l 


is known as the optimum nesting property and is instrumental in the development of order- recursive algorithms. 
Similarly, we can show that x [m] =x,, implies 











* Xm * d, 
dna = EX pay JHE | y |- | (6.1.10) 
X n+l dar 
or x" =x, >d, =d") (6.1.11) 


that is, the right-hand side of the normal equations also has the optimum nesting property. 
Since (6.1.9) and (6.1.11) hold for all 1 < m < M , the correlation matrix R,, and the cross-correlation 
vector dy contain the information for the computation of all the optimum estimators ¢, for 1 <m <M. 


6.1.2 Inversion of Partitioned Hermitian Matrices 


Suppose now that we know the inverse R,,' of the leading principal submatrix R! [m] = =R,, of matrix Rm, and 
we wish to use it to compute R,,|, without having to repeat all the work. Since the inverse Qn, of the Hermitian 
matrix Rm, is also Hermitian, it can be partitioned as 








Q, =|2n “| (6.1.12) 
Gn In 
Using (6.1.6), we obtain 
R minn = = (6.1.13) 
ron Pallan Im) (On 1 
After performing the matrix multiplication, we get 
R,@,,+r°q' =1,, (6.1.14) 
Tn Qn + Pn Im = 98 (6.1.15) 
R pam +n In = Om (6.1.16) 
dn + prgn =1 (6.1.17) 
where 0,, isthe m x1 zero vector. If matrix R,, is invertible, we can solve (6.1.16) for q,, 
q, =—R-'r°a,, (6.1.18) 
and then substitute into (6.1.17) to obtain g,, as 
1 
7. = rR (6.1.19) 


assuming that the scalar quantity p> —r2"R,'r? +0 . Substituting (6.1.19) into (6.1.18), we obtain 
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_Rr? 
7, = Dr" RT (6.1.20) 
which, in conjunction with (6.1.14), yields 
-lb -1 bH 
Q, = RÀ — Rjirtgt = R Pan Eaa) (6.1.21) 


pÈ -r Rri 
We note that (6.1.19) through (6.1.21) express the parts of the inverse matrix Qm, in terms of known quantities. For 
our purposes, we express the above equations in a more convenient form, using the quantities 

b, [bP BO - BT -R,T (6.1.22) 


m-l 


A 


and oS pt -r Ror. = pp +ry"b, (6.1.23) 


Thus, if matrix Rm is invertible and a? #0, combining (6.1.13) with (6.1.19) through (6.1.23), we obtain 


R, rey [Ro 
ri =| $ S -|5 ohif 1] (6.1.24) 
rH p oœ oj ali 


which determines R}, from R, by using a simple rank-one modification known as the matrix inversion by 
partitioning lemma (Noble and Daniel 1988). 
Another useful expression for @> is 


a = Rmn (6.1.25) 


”  detR, 


which reinforces the importance of the quantity @> for the invertibility of matrix R,,,, (see Problem 6.1). 
EXAMPLE 6.1.1 Given the matrix 





and the inverse matrix 


1 1 
£ >| 1f4 -2 
z= EM ‘| 
2 i 


2 


compute matrix R;', using the matrix inversion by partitioning lemma. 
Solution. To determine R3' from the order-updating formula (6.1.24), we first compute 


1 
x i4 -2]13 if1 
b=-Rirt =i $ 4 1 =- 


1 
and at = pt +n, 1i z l-2 


using(6.1.22) and (6.1.23). Then we compute 
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27 -2 -3 
-t |1|=1720 = 32 =12 
3 -12 27 





using (6.1.24). The reader can easily verify the above calculations using MATLAB. 
Following a similar approach, we can show (see Problem 6.2) that the inverse of the lower right corner 
partitioned matrix R,,,, can be expressed as 














f fH el 0 0” 1 
aA Tt es $ RS [1 aš] (6.1.26) 
Ta Ral (9, CR] Onlan 
where a. = fa a ... a™ F SR yy (6.1.27) 
det R 
fA Af —r" R' =f f yra — ‘m+ (6.1.28) 
m Pm m ( $) m Pm m m det RÉ 


and the relationship (6.1.26) exists if matrix RĀ is invertible and œf #0. A similar set of formulas can be 
obtained for arbitrary matrices (see Problem 6.3). 

Interpretations. The vector b„, defined by (6.1.22), is the MMSE estimator of observation x,,,; from data 
vector X. Indeed, if 


e =X. Z Ema = Xu) HOR (6.1.29) 
we can show, using the orthogonality principle E{x„e}*}=0, that b„ results in the MMSE given by 
P? = p? +b”r? =a, (6.1.30) 


Similarly, we can show that a„ , defined by (6.1.27), is the optimum estimator of x, based on 
Km = 1X2 X3 *** Xm]! . By using the orthogonality principle, E{x,,e{*}=0 , the MMSE is 


P! = pf +r”a, =a (6.1.31) 


If Xmu =[x(n) x (n—1) «+» x(n—m)]', then b„ provides the backward linear predictor (BLP) and a,, the 
forward linear predictor (FLP) of the process x(n) from Section 5.5. For convenience, we always use this 
terminology even if, strictly speaking, the linear prediction interpretation is not applicable. 


6.1.3 Levinson Recursion for the Optimum Estimator 


We now illustrate how to use (6.1.24) to express the optimum estimator C,,,, in terms of the estimator C, . Indeed, 
using (6.1.24), (6.1.10), and the normal equations R,,c,, =d,,, we have 


Ca = Rard ms 
á b d 
JEn On Hag) [be |” 
or o 1 d 
Rnd, | |n bad, +d 
0 1 a 
b 
Ea (l k (6.1.32) 
olı 


d,, 
d 


m+l 











or more concisely 
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where the quantities 
k: £ Bn (6.1.33) 
"Oh 
c APH 
and B; Eb dn tdp (6.1.34) 


contain the “new information” dm+ı (the new component of d,,,,). By using (6.1.22) and R,,c,, =d, alternatively 
Bn can be written as 


C bH 
Ba = Tp Cmn tady (6.1.35) 


We will use the term Levinson recursion for the order-updating relation (6.1.32) because a similar recursion was 
introduced as part of the celebrated algorithm due to Levinson (see Section 6.3). However, we stress that even though 
(6.1.32) is order-recursive, the parameter vector Cm+ı does not have the optimum nesting property, that is, 
lten. 

Clearly, if we know the vector b„, we can determine c,,,,, using (6.1.32); however, its practical utility depends 
on how easily we can obtain the vector B,,. In general, b„ requires the solution of an mxm linear system of 
equations, and the computational savings compared to direct solution of the (m+1)st-order normal equations is 
insignificant. For the Levinson recursion to be useful, we need an order recursion for vector b,,. Since matrix Rn, 
has the optimum nesting property, we need to check whether the same is true for the right-hand side vector in 
RyiOms1 =—r’,,;- From the definition r? + E{x,,x*,,}, we can easily see that rtir] #r° and rtl] +r} . Hence, 
in general, we cannot find a Levinson recursion for vector b, . This is possible only in optimum filtering problems in 
which the input data vector x,,(n) has a shift-invariance structure (see Section 6.3). 

EXAMPLE 6.1.2 Use the Levinson recursion to determine the optimum linear estimator ¢, specified by the matrix 


på 
2 3 
TE: 
2 2 
2L 

3 2 

in Example 6.1.1 and the cross-correlation vector 

d,=[12 4f 


Solution. For m=1 we have rc =d,, which gives c =]. Also, from (6.1.32) and (6.1.34) we obtain kẹ =c{” =1 and 
J =d, =1. Finally, from ks = J§/ak , we get ag =1. 
To obtain c,,weneed b{”, kf, Be, and œ .We have 





1 
p 1 
pi =- SP -Ana 
B =H), +d, =-+()+2=3 
1 1 Si 2 2 2 
1, 1. 3 
o = b bh —44— a, 
1 Pi +h D z. > 4 

Ke =A = 


a 


and therefore 
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To determine C3, we need b,, Jf, and a3 .To obtain b,, we solve the linear system 


1 


3 b? 
Rb,=-r; or 2 al =- 
1 2 


— 


N|= wl— 

= 

Il 

| 
Ol 
accra | 
A = 
—— | 


and then compute 


i 1 1 
Bi =b'd, +d, = a |+4=3 


1 11 1). 20 
of = p} +r"b, =1+|— = —— |=— 
ge the Se E GK 5) 27 


81 
ki = pE lœ =— 
2 B ay 20 


The desired solution C3 is obtained by using the Levinson recursion 
1 


2 -9 

b 
aal hesg oh tll a a 
o} [1 of | 2/2020] z 


1 


which agrees with the solution obtained by solving R,c, =d, using the function c3=R3\d3 . We can also solve this linear system 

by developing an algorithm using the lower partitioning (6.1.26) as discussed in Problem 6.4. 

Matrix inversion and the linear system solution for m=1 are trivial (scalar division only). If R, is strictly 
positive definite, that is, R,, = [m] is positive definite for all 1 < m < M , the inverse matrices R}! and the 
solutions of R,,c,,=d,,, 2 < m < M, can be determined using (6.1.22) and the Levinson recursion (6.1.32) for 
m=1, 2, --, M —1. However, in practice using the LDL" provides a better method for performing these 
computations. 


6.1.4 Order-Recursive Computation of the LDL" Decomposition 


We start by showing that the LDL z decomposition can be computed in an order-recursive manner. The procedure is 
developed as part of a formal proof of the LDL” decomposition using induction. 

For M =1, the matrix R, is a positive number n, and can be written uniquely in the form ñı =1-G-1>0, 
As we increment the order m, the (m-+1)st-order principal submatrix of R, can be partitioned as in (6.1.6). By 
the induction hypothesis, there are unique matrices L,, and D,, such that 


R,=LD,-, (6.1.36) 
We next form the matrices 
L 0 D 0 
L =|” D =|" (6.1.37) 
‘m+1 k i m+l | 0" E | 


and try to determine the vector l„ and the positive number nii so that 
Rpa = Lpa Dp Lia (6.1.38) 
Using (6.1.6) and (6.1.36) through (6.1.38), we see that 
(LD lp, =r} (6.1.39) 
Pm =h Dln + Smeir Eaa > 0 (6.1.40) 
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Since det R,, = det L,, det D, det L} =EE,---E, >0 (6.1.41) 
then detL,D,,#0O and (6.1.39) has a unique solution J, Finally, from (6.1.41) we obtain 
Envi = et R,,.1/det Rn, and therefore mm >0 because R,,,,; is positive definite. Hence, Č, is uniquely 
computed from (6.1.41), which completes the proof. 

Because the triangular matrix L,, is generated row by row using (6.1.39) and because the diagonal elements of 
matrix D,, are computed sequentially using (6.1.40), both matrices have the optimum nesting property, that is, 
| PA oe D,, = D'"|. The optimum filter c,, is then computed by solving 


L,,D,,K, =, (6.1.42) 
Pe =k. (6.1.43) 


has the optimum nesting property, that is, k, =k Ml for 
1 < m < M . This is a consequence of the lower triangular form of L,,. The computation of L,,, D,,, and k,, 
can be done in a simple, order-recursive manner, which is all that is needed to compute c,, for 1 <m <M. 
However, the optimum estimator does not have the optimum nesting property, that is, cl #C,,, because of the 


backward substitution involved in the solution of the upper triangular system (6.1.43) (see Example 5.3.1). 
Using (6.1.42) and (6.1.43), we can write the MMSE for the mth-order linear estimator as 


Using (6.1.42), we can easily see that k 


m 


P, = P,-end,, =P, -ky DK, (6.1.44) 
which, owing to the optimum nesting property of D„ and k, leads to 
Pi = Fest =n LA f (6.1.45) 


which is initialized with PR, = P, . Equation (6.1.45) provides an order-recursive algorithm for the computation of the 
MMSE. 


6.1.5 Order-Recursive Computation of the Optimum Estimate 


The computation of the optimum linear estimate m =cHXm , using a linear combiner, requires m multiplications 
and m-—1 additions. Therefore, if we want to compute j,,, for 1 < m < M , we need M linear combiners and 
hence M(M +1)/2 operations. 

We next provide an alternative, more efficient order-recursive implementation that exploits the triangular 
decomposition of R,,,,. We first notice that using (6.1.43), we obtain 


F =e, =K), =, (x) (6.1.46) 
Next, we define vector w,, as 
fw =x. (6.1.47) 


which can be found by using forward substitution in order to solve the triangular system. Therefore, we obtain 
Sn =kiw, =>) Fo, (6.1.48) 
i=l 

which provides the estimate j,, intermsof k,, and w,,, that is, without using the estimator vector c,,. Hence, if 
the ultimate goal is the computation of Y,, we do not need to compute the estimator €p. 

For an order-recursive algorithm to be possible, the vector w,, must have the optimum nesting property, that is, 
Wm = wil Indeed, using (6.1.37) and the matrix inversion by partitioning lemma for nonsymmetric matrices (see 
Problem 6.3), we obtain 
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gis | ee o [r 0 
m+ 1" 1 = v” 1 


where v,, =- ln =L) D} Lr? =-R;'r? =b 


m m 


due to (6.1.22). Therefore, 














L, olx w 
W mti = | Sa ae = H "j= . (6.1.49) 
bn 1 Xn Wrst 
where Wl = bi Xn + X m+ = en (6. l 50) 


from (6.1.29) In this case, we can derive order-recursive algorithms for the computation of Ym and em, for all 
1 < m < M . Indeed, using (6.1.48) and (6.1.49), we obtain 


Fm =Imat kin (6.1.51) 
with ¥,=0. From (6.1.51) and em =y—J,,, we have 


Cn =n — Kk, O,, (6.1.52) 


m 


for m=1, 2, ---, M with e =y. The quantity @, can be computed in an order-recursive manner by solving 
(6.1.47) using forward substitution. Indeed, from the mth row of (6.1.47) we obtain 


m-l 
O, =X,- >, p a (6.1.53) 
i=l 


which provides a recursive computation of @, for m=1, 2, ---, M . To comply with the order-oriented notation, 
we use /"'" instead of Ln-1,.-1 - Depending on the application, we use either (6.1.51) or (6.1.52). 

For MMSE estimation, all the quantities are functions of the time index n, and therefore, the triangular 
decomposition of R,, and the recursions (6.1.51) through (6.1.53) should be repeated for every new set of 
observations y(n) and x(n). 


A linear estimator is specified by the correlation matrix R, and the cross-correlation vector d, in 


EXAMPLE 6.1.3. A linear estimator is specified by the correlation matrix R4 and the crosscorrelation vector d4 are given Compute 
the estimates Yn, 1 < m < 4, ifthe input data vector is given by x, =[1 21 —1]" 


Solution. Using the triangular factor L4 and the vector k4 found in Example 5.3.2 and (6.1.53), 
we find w=R"'dw =[1 -1 3 -8] 


and =l F,=4 §,=66 §,=146 
which the reader can verify by computing ¢,, and j,,=c7x,,, 1< m < 4. 
If we compute the matrix 
1 0 > 0 
bo 1 TPE QO 
B...=f,=| ° La (6.1.54) 
p b™ .. 1 


then (6.1.49) can be written as 
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= Bx (6.1.55) 


m+1°" m+1 


where e Ele e «ey (6.1.56) 


is the BLP error vector. From (6.1.22), we can easily see that the rows of B,,,, are formed by the optimum 
estimators b,, of Xm from X,,. Note that the elements of matrix B„, are denoted by using the order-oriented 
notation b{”) introduced in Section 6.1 rather than the conventional b,,; matrix notation. Equation (6.1.55) 
provides an alternative computation of W+, as a matrix-vector multiplication. Each component of wW,,,,; can be 
computed independently, and hence in parallel, by the formula 


= 
mert 0%, Lsj sm (6.1.57) 
i=l 


which, in contrast to (6.1.53), is nonrecursive. Using (6.1.57) and (6.1.51), we can derive the order-recursive MMSE 
estimator implementation shown in Figure 6.1. 

Finally, we notice that matrix B,, provides the UDU" decomposition of the inverse correlation matrix R, . 
Indeed, from (6.1.36) we obtain 


Ri =UV DÀTI =A DB, (6.1.58) 


because inversion and transposition are interchangeable and the UDU” decomposition is unique. This formula 
provides a practical method to compute the inverse of the correlation matrix by using the LDL" decomposition 
because computing the inverse of a triangular matrix is simple (see Problem 6.5). 


Input Decorrelator Innovations Linear Output 
combiner 


x} 


X2 


X4 









Basic processing element 
Gin 


Xin You = bXin + Ain 


Second-order moments 


FIGURE 6.1 
Orthogonal order-recursive structure for linear MMSE estimation. 


6.2 Interpretations of Algorithmic Quantities 


We next show that various intermediate quantities that appear in the linear MMSE estimation algorithms have 
physical and statistical interpretations that, besides their intellectual value, facilitate better understanding of the 
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operation, performance, and numerical properties of the algorithms. 


6.2.1 Innovations and Backward Prediction 
The correlation matrix of Wm is 
E(w, wh }= L E{x„, x} Gr =D, (6.2.1) 


where we have used (6.1.47) and the triangular decomposition (6.1.36). Therefore, the components of w,, are 
uncorrelated, random variables with variances 


§,=E{\a,/} (6.2.2) 


since € 20. Furthermore, the two sets of random variables {@, @,, ... @,} and {%, X2, >+, Xm} are 
linearly equivalent because they can be obtained from each other through the linear transformation (6.1.47). This 
transformation removes all the redundant correlation among the components of Xx and is known as a decorrelation 
or whitening operation (see Section 3.3.2). Because the random variables @ are uncorrelated, each of them adds 
“new information” or innovation. In this sense, {@, @, ---, @,} is the innovations representation of the random 
variables {x,, X2, +*+, Xm}. Because x,,=L,,w,,, the random vector w,, =e}, is the innovations representation, 
and x,, and w,, are linearly equivalent as well, 

The cross-correlation matrix between X,, and Wm is 

E{x,w'}=E{L,w,wi}=L,D,, (6.2.3) 

which shows that, owing to the lower triangular form of L,,, E{x,;@;}=0 for j>i. 

Furthermore, since e> = @,,,, , from (6.1.50) we have 


Pè = Ena E E{| Dns P} 


which also can be shown algebraically by using (6.1.41), (6.1.40), and (6.1.30). Indeed, we have 
det R 


” m — o —]ĦD J = p? -r™ Rr? = pr (6.2.4) 
AM det R, Pm m m m Pm m mim m 
and, therefore, 
D,, =diag{P’, P*, ---, Pè} (6.2.5) 
6.2.2 Partial Correlation 
In general, the random variables y, x, +++, Xm, Xms; are correlated. The correlation between y and x,,,,, after the 


influence from the components of the vector x,, has been removed, is known as partial correlation. To remove the 
correlation due to X,,, we extract from y and X„, the components that can be predicted from x,,. The 


remaining correlation is from the estimation errors €„ and e?,, which are both uncorrelated with x,, because of 
the orthogonality principle. Therefore, the partial correlation of y and Xm, is 


PARCOR(); x,,,,) = Efe,e"} = E{(y —c''x,, Jer} 
= Ef yer} = El y(x,4,+2,5,,)} 
= Ef yxa} + El yx bn 

=d,,,+dnb,, = By 


where we have used the orthogonality principle E{x,,e°7 }=0 and (6.1.10), (6.1.50), and (6.1.34). 
The partial correlation PARCOR(y; Xm,ı) is also related to the parameters k,, obtained from the LDL" 
decomposition. Indeed, from (6.1.42) and (6.1.54), we obtain the relation 


(6.2.6) 


K msi = DaB nad ma (6.2.7) 
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whose last row is 


H 
km = aia et = 6. | Ph =e. (6.2.8) 
m+ 


owing to (6.2.4) and (6.2.6). 
EXAMPLE 6.2.1 The LDL” decomposition of matrix R, in Example 6.1.2 is given by 





1 
1 
2 
i 
3 


~ 


and can be found by using the function 
rows of the matrix 


L,D] =1d1t (R). Comparison with the results obtained in Example 6.1.2 shows that the 


1 0 O 
1 
LĽL'=|-— 1 0 
2 
A E 
9 9 
provide the elements of the backward predictors, whereas the diagonal elements of D are equal to the scalars q,,. Using (6.2.7), we 


obtain ķ =[1 2 Sii whose elements are the quantities kj, kf, and k; computed in Example 6.1.2 using the Levinson recursion. 
0 


6.2.3 Order Decomposition of the Optimum Estimate 


The equation Pm, = Ym + khna, With kn = B6/P? = kg , shows that the improvement in the estimate when we 
include one more observation Xm, , that is, when we increase the order by 1, is proportional to the innovation @,,,, 
contained in Xm,- The innovation is the part of x,,,,; that cannot be linearly estimated from the already used data 
Xm. The term @p, is scaled by the ratio of the partial correlation between y and the “new” observation x,,,,; and 
the power of the innovation PŁ. 

Thus, the computation of the (m+1) st-order estimate of y based on Xm =[X}, Xs] can be reduced to two 
mth-order estimation problems: the estimation of y based on Xm and the estimation of the new observation 
Xm+1 based on Xm. This decomposition of linear estimation problems into smaller ones has very important 
applications to the development of efficient algorithms and structures for MMSE estimation. 

We use the term direct for the implementation of the MMSE linear combiner as a sum of products, involving the 
optimum parameters c!”’,1 < i < m, to emphasize the direct use of these coefficients. Because the random 
variables @ used in the implementation of Figure 6.1 are orthogonal, that is, <@,@;>=0 for i+ j, we refer to 
this implementation as the orthogonal implementation or the orthogonal structure. These two structures appear in 
every type of linear MMSE estimation problem, and their particular form depends on the specifics of the problem and 
the associated second-order moments. In this sense, they play a prominent role in linear MMSE estimation in general, 
and in this book in particular. 

We conclude our discussion with the following important observations: 

1. The direct implementation combines correlated, that is, redundant information, and it is not 
order-recursive because increasing the order of the estimator destroys the optimality of the existing coefficients. 
Again, the reason is that the direct-form optimum filter coefficients do not possess the optimal nesting property. 

2. The orthogonal implementation consists of a decorrelator and a linear combiner. The estimator 
combines the innovations of the data (nonredundant information) and is order-recursive because it does not use the 
optimum coefficient vector. Hence, increasing the order of the estimator preserves the optimality of the existing 
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lower-order part. The resulting structure is modular such that each additional term improves the estimate by an 


amount proportional to the included innovation @,, . 

3. Using the vector interpretation of random variables, the transformation z,,=F,,xX,, is just a 
change of basis. The choice F„ =Æ! converts from the oblique set {x;, X2, ***, Xm} tothe 
orthogonal basis {@, @,, --:, @,}. The advantage of working with orthogonal bases is that 
adding new components does not affect the optimality of previous ones. 

4. The LDL” decomposition for random vectors is the matrix equivalent of the spectral 


factorization theorem for discrete-time, stationary, stochastic processes. Both approaches facilitate 


the design and implementation of optimum FIR and IIR filters (see Sections 5.3 and 5.6). 


6.2.4 Gram-Schmidt Orthogonalization 


We next combine the geometric interpretation of the random variables with the Gram-Schmidt procedure used in 
linear algebra. The Gram-Schmidt procedure produces the innovations {@, @, ---, @,} by orthogonalizing the 


original set {x,, X2, +t, Xm}- 
We start by choosing @ to be in the direction of X , that is, 
=X 


The next “vector” @, should be orthogonal to @. To determine æ@,, we subtract from x, its component along 


@ [see Figure 6.2(a)], that is, 
a, =x- a 
where /‘ is obtained from the condition @, 1 @ as follows: 


<M,Q> = <x%,Q>-l)?<a,a> = 0 


gt. <a = 
< 0,0 > 


Similarly, to determine @,, we subtract from x, its components along @ and @,, that is, 


or 


0 =% -W @ -li 0 


as illustrated in Figure 6.2(b). Using the conditions @; 1 @ and @ L @,, we can easily see that 


JO Se = ie. =, Ss 
°  <@,@> ' <a, @,> 
This approach leads to the following classical Gram-Schmidt algorithm: 
° Define @ =x. 
*For 2 < m < M, compute 
O, = xp LO Op 


m m 


w, > 


m-l? 


<@,, @,> 


where [Y= = 


I 


assuming that < @, @ > + 0. 
From the derivation of the algorithm it should be clear that the sets {x,, ---, Xm} and {@, 
linearly equivalent for m=1, 2, ---, M . Using (6.2.11), we obtain 


Xm= Li », 


(6.2.9) 


(6.2.10) 


tes, On} are 


(6.2.11) 
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(a)m=2 


FIGURE 6.2 
Illustration of the Gram-Schmidt orthogonalization process. 
1 0 0 
a) 
where Leb |! 4 (6.2.12) 
g= fe a 1 


is a unit lower triangular matrix. Since, by construction, the components of w,, are uncorrelated, its correlation matrix 
D,, is diagonal with elements 6 =E{la/}. Using (6.2.11), we obtain 


R, = E(x, x2 }= L, Fw wi L = L DE (6.2.13) 


which is precisely the unique LDL" decomposition of the correlation matrix R,,. Therefore, the Gram-Schmidt 
orthogonalization of the data vector X,, provides an alternative approach to obtain the LDL” decomposition of its 
correlation matrix R,, = E{x„ x” }. 


6.3 Order-Recursive Algorithms for Optimum FIR Filters 


The key difference between a linear combiner and an FIR filter is the nature of the input data vector. The input data 
vector for FIR filters consists of consecutive samples from the same discrete-time stochastic process, that is, 


x,, (2) =[x(n) x (n—1) = x(n—m+1)]" (6.3.1) 


instead of samples from m different processes x;(n). This shift invariance of the input data vector allows for the 
development of simpler, order-recursive algorithms and structures for optimum FIR filtering and prediction compared 
to those for general linear estimation. Furthermore, the quest for order-recursive algorithms leads to a natural, elegant, 
and unavoidable interconnection between optimum filtering and the BLP and FLP problems. 

We start with the following upper and lower partitioning of the input data vector 


x(n) 
x. (n)= oe a| 99 |_| ae (6.3.2) 
oe i x(n-m) x,,(n—-1) 
x(n—m+1) 
x(n—m) 


which shows that xl”) and xl” (n) are simply shifted versions (by one sample delay) of the same vector 
Xm(n). The shift invariance of x,,,,(m) results in an analogous shift invariance for the correlation matrix 
Ry (n) = Ef xna (n)x#a(n)}. Indeed, we can easily show that the upper-lower partitioning of the correlation 
matrix is 
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R(n)  r}(n) 
=| m 6.3.3 
R„a (n) b- (n) Pir ad (6.3.3) 
and the lower-upper partitioning is 
fH 
R,,,,(n) = oe fa 0 | (6.3.4) 
r,(n) R,(n-1) 
where r° (n) = E{x„(n)x (n-m)t} (6.3.5) 
r! (n) = E{x„(n—1)x" (n)} (6.3.6) 
P (n) = E{| x(n) P} (6.3.7) 


We note that, in contrast to the general case (6.1.5) where the matrix RJ (n) = Re”! (n) isunrelatedto R,,(n), here 
the matrix R!” (n) = R,,(n—1) . This is a by-product of the shift-invariance property of the input data vector and 
takes the development of order-recursive algorithms one step further. We begin our pursuit of an order-recursive 


algorithm with the development of a Levinson order recursion for the optimum FIR filter coefficients. 


6.3.1 Order-Recursive Computation of the Optimum Filter 
Suppose that at time n we have already computed the optimum FIR filter c,,(m) specified by 


c,,(n) = R,'(n)d,,(n) (6.3.8) 

and the MMSE is 
PS (n) = P,(n)—d) (n)e,,(n) (6.3.9) 
where d,,(n) = E{x,,(n)y"(n)} (6.3.10) 


We wish to compute the optimum filter 
Cna (11) = Ryn (n)c,,,,(”) 


by modifying c „(n) using an order-recursive algorithm. From (6.3.3), we see that matrix R,,,,(n) has the 
optimum nesting property. Using the upper partitioning in (6.3.2), we obtain 


x (n) d„(n) 
= m . aj ~ 6.3.11 
d.,.(n) el 5o hy w) nl 63.11 


which shows that d,,,,(n) also has the optimum nesting property. Therefore, we can develop a Levinson order 
recursion using the upper left matrix inversion by partitioning lemma 





A R; (n) 0 1 15,0) jr n 

R =| 1 6.3.12 

m(n) | 0” seal 1 exe ] cre 
where b(n) =—R;(n)r?(n) (6.3.13) 


is the optimum BLP, and 
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pè nye det R,,,,(”) 
det R,, (1) 
is the corresponding MMSE. Equations (6.3.12) through (6.3.14) follow easily from (6.1.22), (6.1.23), and (6.1.24). It 


is interesting to note that b„(n) is the optimum estimator for the additional observation x(n—m) used by the 
optimum filter €n, (7). Substituting (6.3.11) and (6.3.12) into (6.3.11), we obtain 


= P(n—m)+r°"(n)b,,(n) (6.3.14) 


c„a(n)= ks l + ce k(n) (6.3.15) 
0 1 ; 
where k£ (n) ê Bp n) (n) (6.3.16) 
P, (n) 
i B; (n) È bp (n)d,,(n) + dp (N) (6.3.17) 


Thus, if we know the BLP b „(n), we can determine c,,,,(n) by using the Levinson recursion in (6.3.15). 
Levinson recursion for the backward predictor. For the order recursion in (6.3.15) to be useful, we need an 
order recursion for the BLP b,„(n). This is possible if the linear systems 


R,,(n)b,,(n) = —r,(n) 
Rmi (n)b,,.41 (n) aan i (n) 


are nested. Since the matrices are nested [see (6.3.3)], we check whether the right-hand side vectors are nested. We 
can easily see that no optimum nesting is possible if we use the upper partitioning in (6.3.2). However, if we use the 
lower-upper partitioning, we obtain 


rozel i |ka-m-n}el Tasi (7) l (6.3.19) 


x,,(n—1) rè (n—1) 


(6.3.18) 


which provides a partitioning that includes the wanted vector r,2(n) delayed by one sample as a result of the shift 
invariance of x,,(n). To explore this partitioning, we use the lower-upper corner matrix inversion by partitioning 
lemma 


R,,.,(0) = r j= [1 a} (n) | (6.3.20) 
r 0 R?(n-1)| P’(n)|a„(n) k 
where a„(n)ê-R'(n— Dri (n) (6.3.21) 
is the optimum FLP and 
tja eet m 6.3.22 
P! (n) det R (nl P.(n)+r,"(n)a,,(n) ( ) 


is the forward linear prediction MMSE. Equations (6.3.20) through (6.3.22) follow easily from (6.1.26) through 
(6.1.28). Substituting (6.3.20) and (6.3.19) into 


bpa (n) = apa (nr (n) 


we obtain the recursion 


b„„(n)= y + i k? (n) (6.3.23) 
b,(n—1) a,,(n) 
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Re 
where k? (n) a (6.3.24) 
and Bn) =r? (n) +a (nry (n-1) (6.3.25) 


To proceed with the development of the order-recursive algorithm, we clearly need an order recursion for the 
optimum FLP an(n). 

Levinson recursion for the forward predictor. Following a similar procedure for the Levinson recursion of the 
BLP, we can derive the Levinson recursion for the FLP. If we use the upper-lower partitioning in (6.3.2), we obtain 


f 
r a(n) = E{X,,4,(2—-1)x"(n)} = l É (n) | (6.3.26) 
Fnr (n) 
which in conjunction with (6.3.12) and (6.3.21) leads to the following order recursion 
-1 
a„„(n)= ee 4 > : J k! (n) (6.3.27) 
_ Rf 
where ki (n) = -m (6.3.28) 
P, (n—1) 
aii Ba (n) È bp (n= Dry, (n) + rpa (0) (6.3.29) 


Is an order-recursive algorithm feasible? For m=1, we have a scalar equation 7;(n)c\(n)=d,(n) whose 
solution is c{"(n)=d,(n)/7,(n). Using the Levinson order recursions for m=1, 2, ---, M —1, we can find 
Cy(n) ifthe quantities b,,(n—1) and P!(n—1), 1 < m < M, required by (6.3.27) and (6.3.28) are known. The 
lack of this information prevents the development of a complete order-recursive algorithm for the solution of the 
normal equations for optimum FIR filtering or prediction. The need for time updates arises because each order update 
requires both the upper left corner and the lower right corner partitionings 


ees FL x 
mT yx | hx R,,(n-1) 


of matrix R,,,,. The presence of R,,(n—1), which is a result of the nonstationarity of the input signal, creates the 
need for a time updating of b(n). This is possible only for certain types of nonstationarity that can be described by 
simple relations between R,,(m) and R,,(n—1). The simplest case occurs for stationary processes where 
R,, (n) = R,,(n—1) = R,, . Another very useful case occurs for nonstationary processes generated by linear state-space 
models, which results in the Kalman filtering algorithm (see Section 6.8). 

Partial correlation interpretation. The partial correlation between y(n) and x(n—m), after the influence of 
the intermediate samples x(n), x(n—1), ---, x(n—m+l1) has been removed, is 


E{e? (nje (n)} =b} (n)d,,(n)+d,,,,(n) = Be (n) (6.3.30) 


which is obtained by working as in the derivation of 6.26 
It can be shown, that the k,,(m) parameters in the Levinson recursions can be obtained from 
R,,(n) = L,,(n)D,,(n)L},(n) 
L,,(n)D,, (nk; (n) =d,,(n) 
L,,(n)D,,(n)kj (n) =r, (n) 
L,,(n-1)D,, (n—-1)k? (n) =r (n) 


(6.3.31) 
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that is, as a by-product of the LDL" decomposition. 
Similarly, if we consider the sequence x(n), x(n—1), ---, x(n—m), x(n—m-—1), we can show that the partial 
correlation between x(n) and x(n—m-—1) is given by (see Problem 6.6) 


E{e>(n—-Ne!*(n)}=r',,(n) + b8 (n—Dr' (n) = Bi (n) (6.3.32) 
Because rf, (n) = 1r°%,(n) , we have the following simplification 
Bi (n) =b} (n-1)R,,(n-)R,'(n-Dri (n)+ rf (n) 
=r)" (n-1)a„(n) +r% (n) = BY (n) 


which is known as Burg’s lemma (Burg 1975). In order to simplify the notation, we define 


B,,(n) = Bi (n) = 8? (n) (6.3.33) 
Using (6.3.24), (6.3.28), and (6.3.30), we obtain 


2 b(n — Tye" 2 
k? (n)ki (n) =A =e De TT _ (6.3.34) 
P,()P, (2-1) Efe, mM Ef enn-d} 
which implies that 
< ki(n)kb(n) < 1 (6.3.35) 
because the last term in (6.3.34) is the squared magnitude of the correlation coefficient of the random variables 
ef (n) and e>(n-1). 
Order recursions for the MMSEs. Using the Levinson order recursions, we can obtain order-recursive formulas 
for the computation of Pf(n), P>(n),and PS(n). Indeed, using (6.3.26), (6.3.27), and (6.3.29), we have 


P! (n) = P(n)+r™ (nana (n) 


= P (n)+[r" (nrf ol wh, Pe kol 


=P (n)+r™ (n)a „(n)+[r" (n)b„(n—1)+ r (ny ki (n) 


m 


f — pf * f = f 1B.) P 6.3.36 
or P,a (n) = P, (n) + 8p (n)k,, (n) = P, (n) Pn-b aD (6.3.36) 


If we work in a similar manner, we obtain 


2 
Pr. (n) = Pi (n-1)+ B,,(n)ky, (n) = Py (n—1) mAs (6.3.37) 
Pp (n) 
C 2 
and Pia (n) = Py (n)— By (nka (n) = Ph (n) IEE (6.3.38) 
P(n) 


If the subtrahends in the previous recursions are nonzero, increasing the order of the filter always improves the 
estimates, that is, P £, (n) < PS(n). Also, the conditions Pf(n)#0O and P!(n)+#0 are critical for the 
invertibility of R,„(n) and the computation of the optimum filters. The above relations are special cases of (6.1.45) 
The presence of vectors with mixed optimum nesting (upper-lower and lower-upper) in the definitions of p(n) 
and a(n) does not lead to similar order recursions for these quantities. However, for stationary processes we can 
break the dot products in (6.3.17) and (6.3.25) into scalar recursions, using an algorithm first introduced by Schiir. 


6.3.2 Lattice-Ladder Structure 


We saw that the shift invariance of the input data vector made it possible to develop the Levinson recursions for the 
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BLP and the FLP. We next show that these recursions can be used to simplify: the triangular order-recursive 
estimation structure of Figure 6.1 by reducing it to a more efficient (linear instead of triangular), lattice-ladder filter 
structure that simultaneously provides the FLP, BLP, and FIR filtering estimates. 

The computation of the estimation errors using direct-form structures is based on the following equations: 


e! (n) = x(n) +a" (n)x,,(n—1) 
e° (n)= x(n—m)+b2(n)x,,(n) (6.3.39) 
e„(n)= y(n)—c, (n)x,,(n) 


Using (6.3.2), (6.3.27), and (6.3.39), we obtain 


rare a,(n)| [b,(n-)]_,|"[ x, (n—1) 
aoan EO eT Ee 


=x(n)+a} (n)x„(n—1)+[b¥(n-1)x„(n—1)+x(n—-1-m)]k{" (n) 


or es (n) =e, (n) +k, (ned (n—1) (6.3.40) 
In a similar manner, we obtain 
e} (n) =e) (n—1) + kp (nje, (n) (6.3.41) 


Using (6.3.2), (6.3.23), and (6.3.39). Relations (6.3.40) and (6.3.41) are executed for m=0, 1, .. M —2, with 
ef (n) = e} (n) = x(n), and constitute a lattice filter that implements the FLP and the BLP. 
Using (6.3.2), (6.3.15), and (6.3.39), we can show that the optimum filtering error can be computed by 


é,,,,(n) =e, (n) —k“" (n)e? (n) (6.3.42) 


which is executed for m=0, 1, .. M —1, with e )(n) = y(n). The last equation provides the ladder part, which is 
coupled with the lattice predictor to implement the optimum filter. The result is the time-varying lattice-ladder 
structure shown in Figure 6.3. Notice that a new set of lattice-ladder coefficients has to be computed for every n, 
using R„(n) and d,,(n). The parameters of the lattice-ladder structure can be obtained by LDL a decomposition 
using (6.3.31). Suppose now that we know Pj (n) = P? (n)=P,(n), RP(n-1), P$ (n)=P,(n), {B,(n)}i", and 
{8 (n)}¥ . Then we can determine Pf (n), P? (n), and P‘(n) forall m , using (6.3.16) through (6.3.38), and all 
filter coefficients, using (6.3.36), (6.3.24), and (6.3.28). However, to obtain a completely time-recursive updating 
algorithm, we need time updatings for f,,(n) and {<(n). As we will see later, this is possible if R(n) and d(n) 
are fixed or are defined by known time-updating formulas. 
Stage 1 Stage M-1 


eu(n) ein) 





x(n) 





y(n) 


FIGURE 6.3 
Lattice-ladder structure for FIR optimum filtering and prediction. 


We recall that the BLP error vector e>,,(m) is the innovations vector of the data x,,,,(m). Notice that as a 
result of the shift invariance of the input data vector, the triangular decorrelator of the general linear estimator (see 
Figure 6.1) is replaced by a simpler, “linear” lattice structure. For stationary processes, the lattice-ladder filter is 
time-invariant, and we need to compute only one set of coefficients that can be used for all signals with the same R 
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and d (see Section 6.5). 


6.3.3 Simplifications for Stationary Stochastic Processes 


When x(n) and y(n) are jointly wide-sense stationary (WSS), the optimum estimators are time-invariant and we have 
the following simplifications: 
e All quantities are independent of n ; thus we do not need time recursions for the BLP parameters. 
* b„ = Ja}, (see Section 5.5.4), and thus we do not need the Levinson recursion for the BLP b, . 
Both simplifications are a consequence of the Toeplitz structure of the correlation matrix R,,. Indeed, 
comparing the partitionings 


R, Jr r(0) r; 
ei =| m 6.3.43 
Ran (n) he | p A Sn 
where r, =[r(1) r(2) «++ r(m)]" (6.3.44) 


with (6.3.3) and (6.3.4), we have 
R,,(n)=R,,(n-1)=R,, 
ri(n)=r,. (6.3.45) 
r°(n)=Jr,, 


which can be used to simplify the order recursions derived for nonstationary processes. Indeed, we can easily show 


that 

b 
a, = Om | 4 | Om k,, (6.3.46) 

0 1 
where b,, = Ja; (6.3.47) 
k Šk =k” = -Êa (6.3.48) 

m m P. 

B,, * Bi = Be = bir” + r°(m+1) (6.3.49) 
Pa = P = Fe = Prat Paaka = Ppa + Bak, (6.3.50) 


This recursion provides a complete order-recursive algorithm for the computation of the FLP a,, for 1 < m < M 
from the autocorrelation sequence r(J) for O< 7 <M. 

The optimum filters Cm for 1 <m < M can be obtained from the quantities a, and P, for 
1 <m < M-1 and dy, using the following Levinson recursion 





aaa =|" |+| 24" lee (6.3.51) 
0 1 
where kg £ Bn (6.3.52) 
P. 
and B: =b"d, +d pa (6.3.53) 


The MMSE PR; is then given by 
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Fa Poe (6.3.54) 


and although it is not required by the algorithm, P; is useful for selecting the order of the optimum filter. Both 
algorithms are discussed in greater detail in Section 6.4. 


6.4 Algorithms of Levinson and Levinson-Durbin 


Since the correlation matrix of a stationary, stochastic process is Toeplitz, we can explore its special structure to 
develop efficient, order-recursive algorithms for the linear system solution, matrix triangularization, and matrix 
inversion. Although we develop such algorithms in the context of optimum FIR filtering and prediction, the results 
apply to other applications involving Toeplitz matrices (Golub and van Loan 1996). 

Suppose that we know the optimum filter c,, is given by 


c, =R-d, (6.4.1) 


and we wish to use it to compute the optimum filter €, 





Cat = Rgd ms (6.4.2) 
We first notice that the matrix R,,,,; andthe vector d,,,,, can be partitioned as follows 
r(0) -++ r(m—1) |r(m) 
; ze : : R 
| ee Se tae A i r re (6.4.3) 
r*(m-1) = =O) | FD) TaJ r(O) 
r*(m) == r*(1) en 
d 
d = ” (6.4.4) 
m+l bd 
which shows that both quantities have the optimum nesting property, that is, R [a] =R,, and d [ml =d,,. 
Using the matrix inversion by partitioning lemma (6.1.24), we obtain 
R' 0| 1 |6 
-l _ m = jm H 
R} = e di 7 l I |: 1] (6.4.5) 
where b= -R> Irm (6.4.6) 
and P? =r(0) +r” Jbn (6.4.7) 


Substitution of (6.4.4) and (6.4.5) into (6.4.2) gives 


É, b |... 
Cri = 0 + i k; (6.4.8) 





where pea Ba (6.4.9) 
m P? 
and Bo brdn +A ney =n JT mt dp (6.4.10) 


Equations (6.4.8) through (6.4.10) constitute a Levinson recursion for the optimum filter and have been obtained 
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without making use of the Toeplitz structure of R,,,;. 
The development of a complete order-recursive algorithm is made possible by exploiting the Toeplitz structure. 
Indeed, when the correlation matrix R,, is Toeplitz, we have 


bn, = Jan (6.4.11) 


and P, ê P? = P! (6.4.12) 


as we recall from Section 5.5. Since we an determine bm from am, we need to perform only one Levinson recursion, 
either for b,, orfor am. 

To avoid the use of the lower right corner partitioning, we develop an order recursion for the FLP A,n . Indeed, 
to compute Am, from @,,, recall that 


A m+ = ae (6.4. 13) 
which, when combined with (6.4.5) and 
=| ” (6.4.14) 
á r(m+1) 
leads to the Levinson recursion 
a b 
daa = H T IK, (6.4.15) 
0 1 i 
where k„ £ Bn (6.4.16) 
P: 
B, bnr" +r'(m+1) =a) Jra +r'(m+1) (6.4.17) 
and P, =r(0)+r"a} =r(0)+a'r, (6.4.18) 
Also, using (6.1.46) and (6.2.6), we can show that 
m-l 
det R,=[[PZ_ with A =r(0) (6.4.19) 


i=0 
which emphasizes the importance of P, for the invertibility of the autocorrelation matrix. The MMSE P, for 
either the forward or the backward predictor of order m can be computed recursively as follows: 


ees 0 H * 1 an bm k " 
Dr SENS ig ROT Pg (6.4.20) 
=r(0)+r}a +[r"b +r (m+ Dk, 
or Paa = Pp + BK = Pp + Bpm (6.4.21) 


The following recrusive formula for the computation of the MMSE 
Pet = Pa Baka = Pa Bis Ka (6.4.22) 


can be found by using (6.4.8). 

Therefore, the algorithm of Levinson consists of two parts: a set of recursions that compute the optimum FLP or 
BLP and a set of recursions that use this information to compute the optimum filter. The part that computes the linear 
predictors is known as the Levinson-Durbin algorithm and was pointed out by Durbin (1960). From a linear system 


204 Statistical and Adaptive Signal Processing 


solution point of view, the algorithm of Levinson solves a Hermitian Toeplitz system with arbitrary right-hand side 
vector d; the Levinson-Durbin algorithm deals with the special case d =r* or Jr. 


Algorithm of Levinson-Durbin 


The algorithm of Levinson-Durbin, which takes as input the autocorrelation sequence r(0), r(1), ---, r(M) 
and computes the quantities a,,, P,,and k,,, for m=1,2,---,M, is illustrated in the following examples. 
EXMPLE 6.4.1. Determine the FLP a, =[a\” a‘? J’ and the MMSE P, from the autocorrelation values r(Q),r(1), and 
r(2).- 
Solution. To initialize the algorithm, we determine the first-order predictor by solving the normal equations r(0)a® = -—r* (1). Indeed, 
we have 





a mers 0) =k, _ fo 
r(0) h 
which implies that A=r() P, =r(0) 


To update to order 2, we needk, andhence f, and P , which can be obtained by 


B = ar’ (1)+ r’ (2) aes r(O)r*(2)—[r" (DP 
1 


r(0) 
. 2(Q)- 1 2 
R =R + Ayko -OkO 
as k _ fF OF -r Or 2) 


! r’ (0)-Ir() Ê 
Therefore, using Levinson’s recursion, we obtain 


_[r@r" (YP -r Or 
r’(0)-Ir() Ê 


and a? = kı 


2) gD 4,0 
a =a +a, k 


which agree with the results obtained in Example 5.5.1. The resulting MMSE can be found by using P,= P + Bk; - 
EXAMPLE 6.4.2 Use the Levinson-Durbin algorithm to compute the third-order forward predictor for a signal x(n) with 


autocorrelation sequence r(0)=3, r(1)=2, r(2)=1,and /(3) => : 


Solution. To initialize the algorithm, we notice that the first-order predictor is given by r(0)a® =—r(1) and that for 
m= 0, (6.4.15) gives a =k, . Hence, we have 


o 70 _ 2_k& _& 
r(0) 3 k R 
which implies P, =r(0)=3 A =rQ)=2 


To compute a; by (6.4.15), we need af”, b® =a ,and k, =—ĝ,/P, From (6.4.21), we have 
P =R, + fk, =F- E 
3 ` 
and from (6.4.17) 


2 1 
B, =r Jatr(2)=2-2)+1=—2 
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Hence, k _ £ een: _1 

=-0 -n 

R 5 5 

3 

d =A] alae E 
an a, = 3 +3 3 = i 
0 1 = 

5 


Continuing in the same manner, we obtain 


P,=B+ 8k 2+) 
1 
b, =r"Ja:+r8)=[2 1 y 5 
S 
1 
p- 0.1 
P 8 16 
5 
4) 1 13 
s| |5 16 
a-fe -= ijja 1 
0 1 5 5 |16 4 
g| | 1 “a 
16 


8 i 1 51 
P, =P, + fk, =—+—| -— |=— 
3 =P, + Bk, 5 í 1 =) 32 
The algorithm of Levinson-Durbin, summarized in Table 6.2, requires M? operations and is implemented by the function 
{a,k, Po] =durbin(r,M). 


TABLE 6.2. 
Summary of the Levinson-Durbin algorithm. 


1. Input: r(0),r(1),r(2),---,r(M) 
2. Initialization 

(a) R=r(0), A =r*(1) 

(b) ky =-r*(1)/r(0), af? = ko 
3.For m=1,2,---,M —1 

(@) P, = Pas + Bnakm-1 

(b) r„ =[r(1) r(2) +--+ r(m)]" 

(c) By, =a) Jr;,+r° (m+) 

(d) km =—Bm l Pn 


efa 


4. Py = Pua + Buku 
5. Output: ay {km WP, j 


Algorithm of Levinson 
The next example illustrates the algorithm of Levinson that can be used to solve a system of linear equations 
with a Hermitian Toeplitz matrix and arbitrary right-hand side vector. 


206 Statistical and Adaptive Signal Processing 


EXAMPLE 6.4.3 Consider an optimum filter with input x(n) and desired response y(n). The autocorrelation of the input signal 
is r(0)=3, r(1)=2, and r(2)=1. The cross-correlation between the desired response and input is d, =1, d,=2, and 


d, = 5 ; and the power of y(n) is P, = 3. Design a third-order optimum FIR filter, using the algorithm of Levinson. 


Solution. We start initializing the algorithm by noticing that for m=0 wehave r(0)a\” =-—r(1) , which gives 
ont 
r(0) 3 


P, =r(0) =3 A=rD=2 
2. 5 
ang P=R+tAk=3 +207 


Next, we compute the Levinson recursion for the first-order optimum filter 


R=5 fe =d=1 


eae at -1 
oT AO 3 
1. & 
P: = P: — eke =3-1(— nn hee 
1 0 B 0 D 3 


Then we carry the Levinson recursion for m =1 to obtain 


2 1 
B =r" Ja\+r(2) =2(-—) +1=-= 
3 3 
1 
B31 
k =-—t=-— =- 
' P 5 5 
3 
3 a) l-4 
a =| 3 n Ve 
0 1 2 
5 
5 i Ome. 
P,=P+Bk, =~+(--\(-) == 
2 =F + Bk, att == 


for the optimum predictor, and 


A 2 4 
B =a} Jd\+d, =-—(1)+2=— 
3 3 
4 
e 3 4 
TEEN Bl NSE 
CAAS 
3 
1 2] | 
alaja] > 
= = 
ol "lal 14 
5 
$44 8 
P: = P — F a ae a a =z— 
St AR 3 39? 5 


for the optimum filter. The last recursion (m =2) is carried out only for the optimum filter and gives 
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1 4i] 5 ou 
E= T +d eia ae 4+—=— 
f; =a, Jd:+d, E J 2 10 


8 AIL 27 
Pf = Pf -piki ==-—(—) = — 
a Pla = 5-19 G6) 30 
The algorithm of Levinson, summarized in Table 6.3, is implemented by the MATLAB function 
[c,k,kc, Pc] =levins(R,d,Py,M) and requires 2M’ operations because it involves two dot products and 
two scalar-vector multiplications. A parallel processing implementation of the algorithm is not possible because the 
dot products involve additions that cannot be executed simultaneously. Notice that adding M =2° numbers using 


M /2 adders requires q=log,M_ steps. 
Minimum phase and autocorrelation extension 
Using (6.4.16), we can also express the recursion (6.4.21) as 


P 


m+l 


2 
=P (l-Ik„ř)=P, -PaL (6.4.23) 


m 


TABLE 6.3. 
Summary of the algorithm of Levinson. 
1input:  (r()}¥ {da} P, 
2 Initialization 
@ R=r(0), =r), Pf =P, 
(b) ky =—fy/P, al? =ko 
(c) pi =d, 
(d) kg =—B5/P. ch? =ké 
(e) Pe = Pe + Boks" 
3For m=1,2,:--,M-1 
(a) r, =[r(1) r(2) «++ rn) 
b) Bn =a) Jr;,+r°(m+1) 
(c) P, = Prat Bnakin-s 
(d) k, =-B,/P, 


oats th 


(69) Ps = 8 Jr m+ dmn 
(g) kg = ps IP, 


H e, [f Hre 
0 1 
G) Pra = Ph + Bake 
4 Output: ay; Cm, {Kas kah, {Pis Pele 
which, since P, 2 0 , implies that 
(6.4.24) 
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and since the matrix R,, is positive definite, then P, >0 and (6.4.23) implies that 
|km [<1 (6.4.25) 
forall 1<m< M .If 
R, >- > Pia > Py, =0 (6.4.26) 


then the process x(n) is predictable and (6.4.23) implies that 


ky =+1 and |k,, <1 1<k<M (6.4.27) 
(see Section 5.6.4). Also if 
Py > Py = 7=P,=P>0 (6.4.28) 
from (6.4.23) we have 
k„=0 for m> M (6.4.29) 


m 


which implies that the process x(n) is AR(M) and ef, (n) ~ WN(0, P, ) (see Section 3.2.3). Finally, we note that 
since the sequence A, R, A, ... is nonincreasing, its limit as m— œ exists and is nonnegative. A regular 
process must satisfy |k„|<1 for all m, because |k„|=1 implies that P, =0, which contradicts the regularity 
assumption. 

For m=0, (6.4.19) gives P, = r(0) . Carrying out (6.4.23) from m=0 to m=M , we obtain 


M 
P, =rO[ [G-1k,. 7) (6.4.30) 
m=l 


which converges, as M —œ,if |k, |<1. 


6.5 Lattice Structures for Optimum Fir Filters And Predictors 


To compute the forward prediction error of an FLP of order m , we use the formula 


ef (n) = x(n) +ah'x,,(n—1) =x(n)+ >. a\”"x(n—k) (6.5.1) 
k=l 
Similarly, for the BLP we have 
m-l 
ep, (n) = x(n—m) + bp x,, (n) = x(n—m) +)? Bi” x(n+1-k) (6.5.2) 
k=0 


Both filters can be implemented using the direct-form filter structure shown in Figure 6.4. Since a,, and b,, do not 
have the optimum nesting property, we cannot obtain order-recursive direct-form structures for the computation of 
the prediction errors. However, next we show that we can derive an order-recursive lattice-ladder structure for the 
implementation of optimum predictors and filters using the algorithm of Levinson. 





FIGURE 6.4 
Direct-form structure for the computation of the 77 th-order forward and backward prediction errors. 
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6.5.1 Lattice-Ladder Structures 
We note that the data vector for the (m + 1) st-order predictor can be partitioned in the following ways: 
Xn (n) =[x(n)x(n-1)---x(n-—m+1)x(n—- m)]" 
=[x,,(n)x(n—m)]" (6.5.3) 
=n), (n—- DI" (6.5.4) 
Using (6.5.1), (6.5.3), (6.4.15), and (6.5.2), we obtain 


BHH Pes 
+ k 
0 1 | "| | x(n—m-1) 


=x(n)+a!x, (n-1)+k*[b'x, (n—-1)+x(n—-1—m)] 


ef ,(n) = x(n) + 





or en (n) = €, (2) + kren (2-1) (6.5.5) 
Using (6.4.11) and (6.4.15), we obtain the following Levinson-type recursion for the backward predictor: 
0 














The backward prediction error is 


1 
e° ,(n) = x(n—m-1)+ + 
a 


m m 


JEA 
x,,(n—1) 


= x(n—m-1)+b'x, (n-1)+k,,[x(n)+a"x, (n—1)] 

















or e (n) =e? (n—1)+k,,e/ (n) (6.5.6) 


Recursions (6.5.5) and (6.5.6) can be computed for m=0,1,---,M—1. The initial conditions ef (n) and e(n) 
are easily obtained from (6.5.1) and (6.5.2). The recursions also lead to the following all-zero lattice algorithm 


e (n) = ep (n) = x(n) 
ef (n)= e, (n)+kt e? (n-1) m=l, 2, ---,M 
e° (n)=k„ e" (n)+e?_(n-—1) m=1, 2, ---, M 


e(n) = ey, (n) 


(6.5.7) 


that is implemented using the structure shown in Figure 6.5. The lattice parameters k,, are known as reflection 
coefficients in the speech processing and geophysics areas. 


Stage 1 Stage M 


efn) e\(n) ehn) 









x(n) 





ed(n) e(n) epn) 


FIGURE 6.5 
All-zero lattice structure for the implementation of the forward and backward prediction error filters. 


The Levinson recursion for the optimum filter, (6.4.8) through (6.4.10), adds a ladder part to the lattice structure 
for the forward and backward predictors. Using (6.4.8), (6.5.7), and the partitioning in (6.5.3), we can express the 
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filtering error of order m+1 intermsof e„(n) and e?(n) as follows 
ea (n) = y(n) -cei Xna (n) =e, (n)— ke? (n) (6.5.8) 


for m=0, 1, -:-, M —1 . The resulting lattice-ladder structure is similar to the one shown in Figure 6.3. However, 
owing to stationarity all coefficients are constant, and k£ (n) = k} (n) =k,, . We note that the efficient solution of the 
Mth-order optimum filtering problem is derived from the solution of the (M—1) st-order forward and backward 
prediction problems of the input process. In fact, the lattice part serves to decorrelate the samples 
x(n), x(n—1), ---, x(n—M), producing the uncorrelated samples ef(n), e?(n), ---, e}, (n) (innovations), which 
are then linearly combined (‘“recorrelated”’) to obtain the optimum estimate of the desired response. 

System functions. We next express the various lattice relations in terms of Z -transforms. Taking the 
z -transform of (6.5.1) and (6.5.2), we obtain 





M 
E,(z)=|1+ >) ate" |X Ê A, (z)X (2) (6.5.9) 
k=l 
M 
E* (zj =| z" +) i hi | X(z) = B,,(z)X (z) (6.5.10) 
k=l 





where A,,(z) and B,,(z) are the system functions of the paths from the input to the outputs of the mm th stage of 
the lattice. Using the symmetry relation a,, = Jb, 1 < m < M,we obtain 

T E. 
B„(z)=z af -] (6.5.11) 
z 
Note that if zọ is a zero of A,,(z), then zō' is azero of B„(z). Therefore, if A„(z) is minimum-phase, then 
B,,(z) is maximum-phase. 

Taking the z -transform of the lattice equations (6.5.7), we have for the mth stage 





Ef (z) = E" (z)+kt z" E? (2) (6.5.12) 
E> (2) = k, Eni (2) +2 Ep (2) (6.5.13) 
Dividing both equations by X(z) and using (6.5.9) and (6.5.10), we have 
A, (z)=A,_\(z) +k. _,2'B,,_,(z) (6.5.14) 
B,,(Z) = kn 14n (2) + Z'B,,_,(Z) (6.5.15) 
which, when initialized with 
A(z) = By(z) =1 (6.5.16) 


describe the lattice filter in the z domain. 
The z-transform of the ladder-part (6.5.8) is given by 


Epa (z) = Ey, (Z) k$ Ep, (z) (6.5.17) 


where E,,(z) isthe z -transform of the error sequence e,,(n). 
All-pole or “inverse” lattice structure. If we wish to recover the input x(n) from the prediction error 
e(n) = ej, (n) , we can use the following all-pole lattice filter algorithm 


ey (n) = e(n) 

e€ (n)=e(n)-ki e (n-1) m=M,M-1, -,1 — 

(6.5. 

e (n)=e° (n-1)+k„ e(n) m=M,M-1, ---,1 
x(n) = eh (n)= e (n) 

which is implemented by using the structure in Figure 6.6. 

Although the system functions of the all-zero lattice in (6.5.7) and the all-pole lattice in (6.5.18) are 
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Haz(z)=A(Z) and Hap(z)=1/A(z), the two lattice structures are described by the same set of lattice coefficients. The 
difference is the signal flow (see feedback loops in the all-pole structure). This structure is used in speech processing 
applications (Rabiner and Schafer 1978). 


6.5.2 Some Properties and Interpretations 


Lattice filters have some important properties and interesting interpretations that make them a useful tool in optimum 
filtering and signal modeling. 


Optimal nesting. The all-zero lattice filter has an optimal nesting property when it is used for the implementation 
of an FLP. Indeed, if we use the lattice parameters obtained via the algorithm of Levinson-Durbin, the all-zero lattice 
filter driven by the signal x(n) produces prediction errors ef(m) and e}, (n) at the output of the m th stage for all 
1 <m < M . This implies that we can increase the order of the filter by attaching additional stages without 
destroying the optimality of the previous stages. In contrast, the direct-form filter structure implementation requires 
the computation of the entire predictor for each stage. However, the nesting property does not hold for the all-pole 
lattice filter because of the feedback path. 


Stage M Stage 1 


ey(n) = e(n) x(n) = ep(n) 





FIGURE 6.5 


All-pole lattice structure for recovering the input signal from the forward prediction error. 


Orthogonality. The backward prediction errors e! (n) for 0 < m < M are uncorrelated (see Section 6.2), 
that is, 


P k=m 
Efe? (n)e;"(n)}=4 ” (6.5.19) 

{en (ne, (n)} F TE 
and constitute the innovations representation of the input samples x(n), x(n—1), --, x(n—m). We see that at a 
given time instant n , the backward prediction errors for orders m=O, 1, 2, ---, M are uncorrelated and are part 


of a nonstationary sequence because the variance E|e’(n)|?}=P, depends on n. This should be expected 
because, for a given n, each e’(n) is computed using a different set of predictor coefficients. In contrast, for a 
given m, the sequence eł (n) is stationary for —o<n<oo. 

Reflection coefficients. The all-pole lattice structure is very useful in the modeling of layered media, where each 
stage of the lattice models one layer or section of the medium. Traveling waves in geophysical layers, in acoustic 
tubes of varying cross-sections, and in multisectional transmission lines have been modeled in this fashion. The 
modeling is performed such that the wave travel time through each section is the same, but the sections may have 
different impedances. The m th section is modeled with the signals ef(n) and et (n) representing the forward 
and backward traveling waves, respectively. 

If Z,, and Z,,, are the characteristic impedances at sections m and m-—1, respectively, then k,, represents 
the reflection coefficients between the two sections, given by 


= 2m Zma 


Ly + Zm- 
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For this reason, the lattice parameters k,, are often known as reflection coefficients. As reflection coefficients, 
it makes good sense that their magnitudes not exceed unity. The termination of the lattice assumes a perfect reflection, 
and so the reflected wave e(n) is equal to the transmitted wave ef (n). The result of this specific termination is an 
overall all-pole model (Rabiner and Schafer 1978). 


Partial correlation coefficients. The partial correlation coefficient (PCC) between x(n) and x(n—m-—1l) 
(see also Section 6.2.2) is defined as the correlation coefficient between ef (n) and eb (n—1), that is, 


a PARCOR{x(n—m-—1); x(n)} 


PCC{x(n—m-—1); x(n)} (6.5.21) 
JEle-DP Ele mP) 
and, therefore, it takes values in the range [—1, 1] (Kendall and Stuart 1979). 
Working as in Section 6.2, we can show that 
E{e>(n—lNe\’(n)}=b'r, +r(m+) = 2, (6.5.22) 
which in conjunction with 
Elle, a—-)P}= Elle, P} =P, (6.5.23) 
and (6.4.16), results in 
bi -fn =—PCC{x(n—m-—1); x(n)} (6.5.24) 


m 
That is, for stationary processes the lattice parameters are the negative of the partial autocorrelation sequence and 
satisfy the relation 


| kn [<1 forall 0 <m < M -1 (6.5.25) 


derived also for (6.4.25) using an alternate approach. 
Minimum phase. The roots of the polynomial A(z) are inside the unit circle if and only if 


Ik,| <1 forall 0 <m < M-I (6.5.26) 


which implies that the filters with system functions A(z) and 1/ A(z) are minimum-phase. The strict 
inequalities (6.5.26) are satisfied if the stationary process x(n) is nonpredictable, which is the case when the 
Toeplitz autocorrelation matrix R is positive definite. 


Lattice-ladder optimization. The output of an FIR lattice filter is a nonlinear function of the lattice parameters. 
Hence, if we try to design an optimum lattice filter by minimizing the MSE with respect to the lattice parameters, we 
end up with a nonlinear optimization problem (see Problem 6.11). In contrast, the Levinson algorithm leads to a 
lattice-ladder realization of the optimum filter through the order-recursive solution of a linear optimization problem. 
This subject is of interest to signal modeling and adaptive filtering (see Chapters 8 and 9). 


6.5.3 Parameter Conversions 


We have shown that the Mth-order forward linear predictor of a stationary process x(n) is uniquely specified 
by a set of linear equations in terms of the autocorrelation sequence and the prediction error filter is minimum-phase. 


Furthermore, it can be implemented using either a direct-form structure with coefficients a a, on a E 
a lattice structure with parameters kı, kz, ..., ky . Next we show how to convert between the following equivalent 


representations of a linear predictor: 
1. Direct-form filter structure: {Py, a, ad, .., dy}- 
2. Lattice filter structure: {Py, ko, ki, ..., kma}. 
3. Autocorrelation sequence: {r(0), r(1), ..., r(M)}. 
The transformation between the above representations is performed using the algorithms shown in Figure 6.7. 
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Step-up recursion 


FIGURE 6.7 
Equivalent representations for minimum-phase linear prediction error filters. 


Lattice-to-direct (step-up) recursion. Given the lattice parameters kı, kz, .... ky and the MMSE error Py, 
we can compute the forward predictor ay by using the following recursions 


a, -| 2a (6.5.27) 
0 1 


P, = Paa- likna P) (6.5.28) 
for m=1, 2, ---, M . This conversion is implemented by the function a=stepup (k) (k, PM). 


Direct-to-lattice (step-down) recursion. Using the partitioning 


= =[ga™ gim ... g 
a a Seal (6.5.29) 
km-i Pe sag 
we can write recursion (6.5.27) as 
am = a, + Jan-Km 
or by taking the complex conjugate and multiplying both sides by J 
Jān= Ja ~ to 
Eliminating Jq;,, from the last two equations and solving for @,,_,, we obtain 
"R= an Jakai (6.5.30) 
J> | Kuyt | 
From (6.5.28), we have 
P 
Pi = — =; (6.5.31) 
l-| kn | 


Given ay and P, ,we can obtain k, and P, for 0<m<M —1 by computing the last two recursions for 
m=M,M -1,---,2. We should stress that both recursions break down if |km |=+1. The step-down algorithm is 
implemented by the function [k, PO] =stepdown (a, PM). 


Example 6.5.1. Given the third-order FLP coefficients aj”, a‘, a‘? , compute the lattice parameters ko, ki, kz - 
Solution. With the help of (6.5.29) the vector relation (6.5.30) can be written in scalar form as 
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| a (6.5.32) 
(m) _ „(m 
e a” =a. k _ 
and mD i met md (6.5.33) 
-T 
which can be used to implement the step-down algorithm for m=M, M-1, ..., 2 andi = 1, 2, ..., m.Startingwith 
m=3and i=1, 2, we have 
=q® a? = ay” — ay" ky Pops ap -a "k, 
: > i l- |k, P i 1-|k, ° 
Similarly, for m=2 and i=1, we obtain 
(2) _ 2 
a,’ -a k 
k =q” a = 1 1 lok 
1 2 1 1-|k, P 0 


which completes the solution. 
The step-up and step-down recursions also can be expressed in polynomial form as 


* -m 4* 1 
A (Do A, (2)+k, 2A, (=) (6.5.34) 

Z 
and A,_,(z) = EF ie es NE (6.5.35) 

11K | 
respectively 
Lattice parameters to autocorrelation. If we know the lattice parameters k,, kz, ..., ky and Py, we can 
compute the values r(0), r(1), ..., r(M) of the autocorrelation sequence using the formula 

r(m+1)=—-k* P, -aË Jr, (6.5.36) 


which follows from (6.4.16) and (6.4.17), in conjunction with (6.5.27) and (6.4.21) for m=1,2,---,M . Equation 
(6.5.36) is obtained by eliminating £,, from (6.4.9) and (6.4.10). This algorithm is used by the function 
r=k2r(k, PM). 


EXAMPLE 6.5.2. Given P,, ky, kı and k,, compute the autocorrelation values r(Q), r(1), r(2),and r(3). 
Solution. Using r(0)= P, and 
r(m+1)=-k* P, -a" Jr, 
for m =0, we have 
rl) = -kP 
For m=1 
r(2) =k; P —a™r(1) 

where P =P (-|k, P) 

Finally, for m=2 we obtain 
r(3) =—k;P, —[a\"r(2) +k} r()] 
where P, =P(-|k, P) 
and from the levinson recursion a” =a +a*k, =k, + kok, 


Direct parameters to autocorrelation. Given ay and P,, we can compute the autocorrelation sequence 
r(0), r(1), --:, r(M) by using (6.5.29) through (6.5.36). This method is known as the inverse Levinson algorithm 
and is implemented by the function r=a2r (a, PM). 


6.6 Summary 


The application of optimum FIR filters and linear combiners involves the following two steps. 
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° Design. In this step, we determine the optimum values of the estimator parameters by solving the normal 
equations formed by using the known second-order moments. For stationary processes the design step is done 
only once. For nonstationary processes, we repeat the design when the statistics change. 

e Implementation. In this step, we use the optimum parameters and the input data to compute the optimum 
estimate. 

The type and complexity of the algorithms and structures available for the design and implementation of linear 

MMSE estimators depend on two factors: 

° The shift invariance of the input data vector. 

° The stationarity of the signals that determine the second-order moments in the normal equations. 

As we introduce more structure (shift invariance or stationarity), the algorithms and structures become simpler. 

From a mathematical point of view, this is reflected in the structure of the correlation matrix, which starting from 

general Hermitian at one end becomes Toeplitz at the other. 

Linear combiners 

The input vector is not shift-invariant because the optimum estimate is computed by using samples from M 

different signals. The correlation matrix R is Hermitian and usually positive definite. The normal equations are solved 
by using the LDL” decomposition, and the optimum estimate is computed by using the obtained parameters. 
However, in many applications where we need the optimum estimate and not the coefficients of the optimum 
combiner, we can implement the MMSE linear combiner, using the orthogonal order-recursive structure shown in 
Figure 6.1. This structure consists of two parts: (1) a triangular decorrelator (orthogonalizer) that decorrelates the 
input data vector and produces its innovations vector and (2) a linear combiner that combines the uncorrelated 
innovations to compute the optimum estimates for all orders 1 <m <M. 


FIR filters and predictors 

In this case the input data vector is shift-invariant, which leads to simplifications, whose extent depends on the 
stationarity of the involved signals. 

Nonstationary case. In general, the correlation matrix is Hermitian and positive definite with no additional 
structure, and the LDL H decomposition is the recommended method to solve the normal equations. However, the 
input shift invariance leads to a remarkable coupling between FLP, BLP, and FIR filtering, resulting in a simplified 
orthogonal order-recursive structure, which now takes the form of a lattice ladder filter (see Figure 6.3). The 
backward prediction errors of all orders 1 < m < M provide the innovations of the input data vector. The 
parameters of lattice structure (decorrelator) are specified by the components of the LDL” decomposition of the input 
correlation matrix. The coefficients of the ladder part (correlator) depend on both the input correlation matrix and the 
cross-correlation between the desired response and the input data vector. 

Stationary case. In this case, the addition of stationarity to the shift invariance makes the correlation matrix 
Toeplitz. The presence of the Toeplitz structure has the following consequences: 

1. The development of efficient order-recursive algorithms, with computational complexity proportional to M’, for 
the solution of the normal equations and the triangularization of the correlation matrix. 

a. Levinson algorithm solves Re =d for arbitrary right-hand side vector d (2M? operations). 

b. Levinson-Durbin algorithm solves Ra=-r* when the right-hand side has special structure ( M? 

operations). 


2. The MMSE FLP, BLP, and FIR filters are time-invariant; that is, their coefficients (direct-form or 
lattice-ladder structures) are constant and should be computed only once. 
The algorithms for MMSE filtering and prediction of stationary processes are the simplest ones. 

However, we can also develop efficient algorithms for nonstationary processes that have special structure. There 
are two cases of interest: 

¢ The Kalman filtering algorithm that can be used for processes generated by a state-space model with known 
parameters. 

e Algorithms for a-stationary processes, that is , processes whose correlation matix is near to Toeplitz, as measured 
by a special distance known as the displacement rank (Morf et al. 1977) 


Problems 
6.1 By first computing the matrix product 
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R e m 0, 
m m roH 1 
nt al os 


and then the determinants of both sides, prove Equation (6.1.25). 
6.2 Prove the matrix inversion lemma for lower right corner partitioned matrices, which is described by Equations (6.1.26) and (6.1.28). 
6.3 This problem generalizes the matrix inversion lemmas to nonsymmetric matrices. 

(a) Show that if R™ exists, the inverse of an upper left corner partitioned matrix is given by 


R r ame aR'+wpy" w 
F o a y” 1 


where 


R'y 2-7 
a2o-7'R'r=o+v'r=o+7'w 


(b) Show that if R™ exists, the inverse of a lower right comer partitioned matrix is given by 
o FT _1f1 y” 
r R al\w aR'+wy" 


Rw =-r 
R'y 2-7 


a2o--'R'r=o+Vv'r=o+;7'w 


where 


(c) Check the validity of the lemmas in parts (a) and (b), using MATLAB. 
6.4 Develop an order-recursive algorithm to solve the linear system in Example 6.1.2, using the lower right corner partitioning lemma 
(6.1.26). 
6.5 In this problem we consider two different approaches for inversion of symmetric and positive definite matrices by constructing an 
arbitrary fourth-order positive definite correlation matrix R and comparing their computational complexities. 
(a) Given that the inverse of a lower (upper) triangular matrix is itself lower (upper) triangular, develop an algorithm for triangular 
matrix inversion. 
(b) Compute the inverse of R, using the algorithm in part (a) and Equation (6.1.58). 
(c) Build up the inverse of R, using the recursion (6.1.24). 
(d) Estimate the number of operations for each method as a function of order M, and check their validity for M =4, using 
MATLAB. 
6.6 Using the appropriate orthogonality principles and definitions, prove Equation (6.3.32). 
6.7 Prove Equations (6.3.36) to (6.3.38), using Equation (6.1.45). 
6.8 Working as in Example 5.3.1, develop an algorithm for the upper-lower decomposition of a symmetric positive definite matrix. 
Then use it to factorize the matrix in Example 5.3.1, and verify your results, using the function [U, D] =udut (R). 
6.9 In this problem we explore the meaning of the various quantities in the decomposition R = UDU H of the correlation matrix. 
(a) Show that the rows of 4 =U! are the MMSE estimator of x,, from Xy415 Xm423--- Xm 
(b) Show that the decomposition R = UYDU H can be obtained by the Gram-Schmidt orthogonalization process, starting with the 
random variable x, and ending with X% , that is, proceeding backward. 
6.10 In this problem we clarify the various quantities and the form of the partitionings involved in the UDU" decomposition, using an 
m=4 correlation matrix. 
(a) Prove that the components of the forward prediction error vector are uncorrelated. 
(b) Writing explicitly the matrix R, identify and express the quantities in Equations (6.3.62) through (6.3.67). 
(c) Using the matrix R in Example 5.3.1, compute the predictors in (6.3.67) by using the corresponding normal equations, verify 
your results, comparing them with the rows of matrix A computed directly from the LDL” decomposition of R7' or the UDU" 
decomposition of R (see Table 6.1). 
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6.11 


6.12 


6.13 
6.14 
6.15 


6.16 


6.17 


6.18 
6.19 


6.20 


6.21 


6.22 


6.23 


6.24 


6.25 


6.26 


6.27 


Given an all-zero lattice filter with coefficients kọ and k, determine the MSE P(k,,k,) as a function of the required 
second-order moments, assumed jointly stationary, and plot the error performance surface. Use the statistics in Example 5.2.1. 
Given the autocorrelation r(0)=1, r(1)=r(2)=1/2, and r(3)=1/4, determine all possible representations for the 
third-order prediction error filter (see Figure 6.7). 

Repeat Problem 6.12 for k, =k, =k, =1/3 and P= (2/3). 

Use Levinson’s algorithm to solve the normal equations Re =d where R =Toeplitz{3, 2, 1} and d =[6 6 2)’. 

Consider a random sequence with autocorrelation {r(/)}} ={1, 0.8, 0.6, 0.4}. (a) Determine the FLP am and the 

corresponding error Pf for m=1, 2, 3. (b) Determine and draw the flow diagram of the third-order lattice prediction error 

filter. 

Using the Levinson-Durbin algorithm, determine the third-order linear predictor a, and the MMSE P, for a signal with 

autocorrelation r(0) =1,r(1)=r(2)=1/2,and r(3)=1/4. 

Given the autocorrelation sequence r(0)=1, r(1)=r(2)=1/2, and r(3)=1/4, compute the lattice and direct-form coefficients 

of the prediction error filter, using the althorithm of schiir. 

Determine p, and p, so that the matrix R = Toeplitz{l, pı, p2} is positive definite. 

Suppose that we want to fit an AR(2) model to a sinusoidal signal with random phase in additive noise. The autocorrelation 

sequence is given by 

r(l) = P, cos @l +07 ôl) 

(a) Determine the model parameters a, as? „and o2 intermsof P,,q@,,and o?. (b) Determine the lattice parameters of the 
model. (c) What are the limiting values of the direct and lattice parameters of the model when go? — 0? 

Given the parameters r(Q)=1, k, =k, =1/2, and k, =1/4, determine all other equivalent representations of the prediction 
error filter (see Figure 6.7). 


Let {r(J)}{ be samples of the autocorrelation sequence of a stationary random signal x(n). (a) Is it possible to extend r(/) 
for |/|>P_ so that the PSD 


o0 


R) = > re” 


l=-<o 

is valid, that is, R(e/”) > 0 ? (b) Using the algorithm of Levinson-Durbin, develop a procedure to check if a given autocorrelation 
extension is valid. (c) Use the algorithm in part (b) to find the necessary and sufficient conditions so that r(0)=1, r(1) = p,, and 
r(2) =p, area valid autocorrelation sequence. Is the resulting extension unique? 
Justify the following statements. (a) The whitening filter for a stationary process x(n) is time-varying. (b) The filter in part (a) 
can be implemented by using a lattice structure and switching its stages on one by one with the arrival of each new sample. (c) If 
x(n) is AR(P), the whitening filter becomes time-invariant P+1 sampling intervals after the first sample is applied. Note: We 
assume that the input is applied to the filter at n=0. If the input is applied at n = —oo , the whitening filter of a stationary process is 
always time-invariant. 

Given the parameters r(0)=1, k, =1/2, k, =1/3, and k,=1/4, compute the determinant of the matrix R4g=Roeplitz 
{r(0), r(1), r(2), r(3)}- 
(a) Determine the lattice second-order prediction error filter (PEF) for a sequence x(n) with autocorrelation r(/) =(1/ 2)". (b) 
Repeat part (a) for the sequence y(n) =x(n)+v(n), where v(n)~ WN (0,0.2) is uncorrelated to x(n). (c) Explain the change 
in the lattice parameters using frequency domain reasoning (think of the PEF as a whitening filter). 
Consider a prediction error filter specified by P, = (15/16), k, =1/4, k, =1/2,and k, =1/4. (a) Determine the direct-form 
filter coefficients. (b) Determine the autocorrelation values r(1), r(2), and r(3). (c) Determine the value r(4) so that the 


MMSE P, for the corresponding fourth-order filter is the minimum possible. 


Consider a prediction error filter Ay (z)=1+a(“z1+---+aUz™ with lattice parameters k,, k,, .... ky. (a) Show that if 
we set kn =(-1)"k,,, then g =(—1)"a™ . (b) What are the new filter coefficients if we set kon = p"km, Where p isa 
complex number with | p |=1? What happens if |p| < 1? 

Show that the MMSE linear predictor of x(n+ D) intermsof x(n), x(n—1), --, x(n—M +1) for D21 is given by 


D (D 
Ra‘ Ds -r ) 
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where r =[r(D) r (D+1) --- r(D+M —1)]'. Develop a recursion that computes a*” from a) by exploring the 
shift invariance of the vector rP’. See Manolakis et al. (1983). 
6.28 Consider the estimation of a constant @ from its noisy observations. The signal and observation models are given by 
y(n+1)=y(n) n>0 y(0)=a 
x(n) = y(n) +v(n) v(n) ~ WGN (0,07) 
(a) Develop scalar Kalman filter equations, assuming the initial condition on the a posteriori error variance R;(0|0) equalto n. 
(b) Show that the a posteriori error variance R;(n |n) is given by 
h 


R. (n | n) = — 
gaim) 1+(%/07)n 


(P.1) 


(c) Show that the optimal filter for the estimation of the constant @ is given by 
nI, 


y(n) = seh ETT TST 


[x(n)— ĵĝ(n-1)] 


6.29 Consider a random process with PSD given by 


a Ts 
2.4661—1.629cos w+ 0.81cos2@ 
(a) Using MATLAB, plot the PSD R, (et?) and determine the resonant frequency @. 
(b) Using spectral factorization, develop a signal model for the process of the form 

y(n) = Ay(n—1) + By(n) 

s(n)=[1 O]y(n) 
where y(n) isa 2x1 vector, 7(n)~WGN(0,1), and A and B are matrices with appropriate dimensions. 
(c) Let x(n) be the observed values of s(n) given by 

x(n) = s(n) +v(n) v(n) ~ WGN (0,1) 

Assuming reasonable initial conditions, develop Kalman filter equations and implement them, using MATLAB. Study the 
performance of the filter by simulating a few sample functions of the signal process s(n) and its observation x(n). 


6.30 Alternative form of the Kalman filter. A number of different identities and expressions can be obtained for the quantities defining 
the Kalman filter. 


(a) By manipulating the last two equations in (6.6.39) show that 


R,(n|n)=R,(n|n-1)-R,(n|n-1)H"(n) 


(P.2) 
x[H (n)R;(n|n-1)H"(n)+R,(n)J'HR;(n|n-1) 
(b) If the inverses of R, (n|n), R;(n|n—1),and R, exist, then show that 
R;'(n|n)=R;'(n|n-1)+H"(n)R,'(n)H(n) (P.3) 


This shows that the update of the error covariance matrix does not require the Kalman gain matrix (but does require matrix 
inverses). 


(c) Finally show that the gain matrix is given by 
K(n)=R,(n|n)H"(n)R,'(n) (P.4) 


which is computed by using the a posteriori error covariance matrix. 
6.31 In Example 6.6.3 we assumed that only the position measurements were available for estimation. In this problem we will assume 
that we also have a noisy sensor to measure velocity measurements. Hence the observation model is 


x(ny | 7° |_| tn) 5) 
x(n) y,(n)+v, (n) 

where y(n) and vy,(n) are two independent zero-mean white Gaussian noise sources with variances o? and o? ; 

respectively. 


(a) Using the state vector model given in Example 6.6.3 and the observation model in (P.5), develop Kalman filter equations to 


6.32 
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estimate position and velocity of the object at each n. 


(b) Using the parameter values 
T=0.1 of =o),=0,=0.25  y,(-I)=0  y,(-l)=1 


simulate the true and observed positions and velocities of the object. Using your Kalman filter equations, generate plots similar to 
the ones given in Figures 6.11 and 6.12. 
(c) Discuss the effects of velocity measurements on the estimates. 


In this problem, we will assume that the acceleration y,(”) is an AR(1) process rather than a white noise process. Let y,(n) 
be given by 


y,(n)=ay,(n—1)+(n) n(n) ~WGN(0,0;) ___y, (-1)=0 (P.6) 


(a) Augment the state vector y(n) in (6.6.48), using variable Ya (n), and develop the state vector as well as the observation 
model, assuming that only the position is measured. 
(b) Using the above model and the parameter values 
T=0.1 @=09 © =0,=0.25 
y(-D=0 y,-D=1 y,-1)=0 
simulate the linear motion of the object. Using Kalman filter equations, estimate the position, velocity, and acceleration values of 
the object at each n. Generate performance plots similar to the ones given in Figures 6.11 and 6.12. 
(c) Now assume that noisy measurements of y,(”) and ya(”) are also available, that is, the observation model is 
x,(n)| | y,(n)+v,(n) 
x(n)#| x,(n) |=| y, +v, (n) P 
x (n)| |y,(n)+v,(n) 


where y,(n),v.(m), and vy,(n) are IID zero-mean white Gaussian noise sources with variance øg? . Repeat parts (a) and (b) 
above. 
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CHAPTER 7 


Least-Squares Filtering and Prediction 


In this chapter, we deal with the design and properties of linear combiners, finite impulse response (FIR) filters, and 
linear predictors that are optimum in the least-squares error (LSE) sense. The principle of least squares is widely used 
in practice because second-order moments are rarely known. In the first part of this chapter (Sections 7.1 through 7.4), 
we concentrate on the design, properties, and applications of least-squares (LS') estimators. Section 7.1 discusses the 
principle of LS estimation. The unique aspects of the different implementation structures, starting with the general 
linear combiner followed by the FIR filter and predictor, are treated in Sections 7.2 to 7.4. In the second part 
(Sections 7.5), we discuss the algorithms for the solution of the LSE normal equations and the computation of LSE 
estimates including QR decomposition techniques (Householder reflections, Givens rotations, and modified 
Gram-Schmidt orthogonalization) and the singular value decomposition (SVD). 


7.1 The Principle of Least Squares 


The principle of least squares was introduced by the German mathematician Carl Friedrich Gauss, who used it to 

determine the orbit of the asteroid Ceres in 1821 by formulating the estimation problem as an optimization problem. 

The design of optimum filters in the minimum mean square error (MMSE) sense, discussed in Chapter 5 
requires the a priori knowledge of second-order moments. However, such statistical information is simply not 
available in most practical applications, for which we can only obtain measurements of the input and desired response 
signals. To avoid this problem, we can (1) estimate the required second-order moments from the available data (see 
Chapter 4, if possible, to obtain an estimate of the optimum MMSE filter, or (2) design an optimum filter by 
minimizing a criterion of performance that is a function of the available data. 

In this chapter, we use the minimization of the sum of the squares of the estimation error as the criterion of 
performance for the design of optimum filters. This method, known as least-squares error (LSE) estimation, 
requires the measurement of both the input signal and the desired response signal. A natural question arising at this 
point is, What is the purpose of estimating the values of a known, desired response signal? There are several answers: 
1. In system modeling applications, the goal is to obtain a mathematical model describing the input-output behavior 

of an actual system. A quality estimator provides a good model for the system. The desired result is the estimator 
or system model, not the actual estimate. 

2. In linear predictive coding, the useful result is the prediction error or the respective predictor coefficients. 

3. In many applications, the desired response is not available (e.g., digital communications). Therefore, we do not 
always have a complete set of data from which to design the LSE estimator. However, if the data do not change 
significantly over a number of sets, then one special complete set, the training set, is used to design the estimator. 
The resulting estimator is then applied to the processing of the remaining incomplete sets. 

The use of measured signal values to determine the coefficients of the estimator leads to some fundamental 
differences between MMSE and LSE estimation that are discussed where appropriate. 

To summarize, depending on the available information, there are two ways to design an optimum estimator: (1) 
If we know the second-order moments, we use the MMSE criterion and design a filter that is optimum for all possible 
sets of data with the same statistics. (2) If we only have a block of data, we use the LSE criterion to design an 
estimator that is optimum for the given block of data. Optimum MMSE estimators are obtained by using ensemble 
averages, whereas LSE estimators are obtained by using finite-length time averages. For example, an MMSE 
estimator, designed using ensemble averages, is optimum for all realizations. In contrast, an LSE estimator, designed 


'A note about abbreviations used throughout the chapter: The two acronyms LSE and LS will be used almost interchangably. Although 


LSE is probably the more accurate term, LS has become a standard reference to LSE estimators. 
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using a block of data from a particular realization, depends on the numerical values of samples used in the design. If 
the processes are ergodic, the LSE estimator approaches the MMSE estimator as the block length of the data 
increases toward infinity. 


7.2 Linear Least-Squares Error Estimation 


We start with the derivation of general linear LS filters that are implemented using the linear combiner structure 
described in Section 5.2. A set of measurements of the desired response y(n) and the input signals x,(n) for 
1<k<M has been taken for 0<n<N-1. As in optimum MMSE estimation, the problem is to estimate the 
desired response y(n) using the linear combination 


M 
$(n) = >) cf (n) x, (n) =c" (n) x(n) (7.2.1) 
k=1 
We define the estimation error as 
e(n) = y(n)— $(n) = y(n)- c" (n) x(n) (7.2.2) 


and the coefficients of the combiner are determined by minimizing the sum of the squared errors 
x, 
EŞ |en)? (1.2.3) 
=0 


that is, the energy of the error signal. For this minimization to be possible, the coefficient vector e(n) should be held 
constant over the measurement time interval 0<n<N-—1. The constant vector €, resulting from this 
optimization depends on the measurement set and is known as the linear LSE estimator. In the statistical literature, 
LSE estimation is known as linear regression, where (7.2.2) is called a regression function, e(n) are known as 
residuals (leftovers), and c(n) is the regression vector (Montgomery and Peck 1982). 
The system of equations in (7.2.2), or equivalently e*(n) = y’(n)—x"(n)c, can be written in matrix form as 
e" (0) y'(0) x; (0) (0) = wO Ie, 


e*(1) = yd) |_| x S) = mA |a (7.2.4) 


e"(N-1)} |y(N-D] | a(N-D (N-1) = xy, (N-1) | Lom 
or more compactly as 


e=y-Xc (7.2.5) 
where 
e = [e(0) e(l) --- e(N =p]Ë error data vector (N x1) 
y Ê [x0 y) -- y(N -Ë desired response vector (N x1) (1.2.6) 
X = [x(0)x(1) --- x(N- nyt input data matrix (N x M) 
c Ê [cecu] combiner parameter vector (M x1) 


are defined by comparing (7.2.4) to (7.2.5). The input data matrix X can be partitioned either columnwise or rowwise 
as follows: 


x” (0) 
x” (7.2.7) 


4 ~ ~ ~ 
X = [x1 Xd -s Xul= 


x"(N-1) 
where the columns x, of X 


t 
I> 


[x (0) x, (1) ro Xk (N Ji 1)]" 


will be called data records and the rows 
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x(n) = [x (n) x(n) = xy (n) 


will be called snapshots. Both of these partitionings of the data matrix, which are illustrated in Figure 7.1, are useful 
in the derivation, interpretation, and computation of LSE estimators. 


Desired Input signals 
response 


y 


Record 






X 


Coefficient 
vector 


c 


æ O 


Time 











samples 
2 p 


Snapshop 





Sensor 


FIGURE 7.1 
The columns of the data matrix are the records of data collected at each input (sensor), whereas each row contains the samples from all 
inputs at the same instant. 


The LSE estimator operates in a block processing mode; that is, it processes a frame of N snapshots using the 
steps shown in Figure 7.2. The input signals are blocked into frames of N snapshots with successive frames 
overlapping by Nọ samples. The values of Nand Nọ depend on the application. The required estimate or residual 
signals are unblocked at the final stage of the processor. 

N No 






















x(n) p Compute and Compute Frame 
Ai cer Solve normal Estimates or unblockia 
Xy(n) ocxng equations residuals I |en) 


y(n) 


FIGURE 7.2 
Block processing implementation of a general linear LSE estimator. 


If we set e =0, we have a set of N equations with M unknowns. If N =M , then (7.2.4) usually has a unique 
solution. For N >M , we have an overdetermined system of linear equations that typically has no solution. 
Conversely, if N <M , we have an underdetermined system that has an infinite number of solutions. However, even 
if M>N or N >M , the system (7.2.4) has a natural, unique, least-squares solution. We next focus our attention 
on overdetermined systems since they play a very important role in practical applications. 


7.2.1 Derivation of the Normal Equations 


We provide an algebraic and a geometric solution to the LSE estimation problem; a calculus-based derivation is given 
in Problem 7.1. 


Algebraic derivation. The energy of the error can be written as 
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E =e"e =(y" —c"X")(y— Xe) 
= y" y—c"X"y-— y" Xc +c”X"Xc (1.2.8) 
= E, —c" Â -ae +c'Re 


where E, £ y"y= Iyn) P (1.2.9) 
n=0 
z N-1 
REX"*X=>) x(x" x) (7.2.10) 
n=0 
be N-I 
d2xX"y=)) x(n)y'(n) (1.2.11) 
n=0 


Note that these quantities can be viewed as time-average estimates of the desired response power, correlation matrix 
of the input data vector, and the cross-correlation vector between the desired response and the data vector, when these 
quantities are divided by the number of data samples N. 


We emphasize that all formulas derived for the MMSE criterion hold for the LSE criterion if we replace the 


N-I 
expectation operator E{(-)} with the time-average operator Zo. This results from the fact that both criteria are 
n=0 


quadratic cost functions. Therefore, working as in Section 5.2.2, we conclude that if the time-average correlation 


matrix Ê is positive definite, the LSE estimator €, is provided by the solution of the normal equations 


A 


Re), =d (7.2.12) 
and the minimum sum of squared errors is given by 
E, =E,-d"R Â =E, -4'¢, (1.2.13) 
Since R is Hermitian, we only need to compute the elements 
ETET (1.2.14) 
in the upper triangular part, which requires M (M +1)/2 dot products. The right-hand side requires M dot products 


d= 4y (7.2.15) 


Note that each dot product involves N arithmetic operations, each consisting of one multiplication and one addition. 
Thus, to form the normal equations requires a total of 


imm +DN+MN=ŻM°N+ŽMN (7.2.16) 


arithmetic operations. When R is nonsingular, which is the case when R is positive definite, we can solve the normal 
equations using either the LDL” or the Cholesky decomposition (see Section 5.3). However, it should be stressed at 
this point that most of the computational work lies in forming the normal equations rather than their solution. The 
formulation of the overdetermined LS equations and the normal equations is illustrated graphically in Figure 7.3. The 
solution of LS problems has been extensively studied in various application areas and in numerical analysis. The 
basic methods for the solution of the LS problem, which are discussed in this book, are shown in Figure 7.4. We just 
stress here that for overdetermined LS problems, well-behaved data, and sufficient numerical precision, all these 
methods provide comparable results. 


224 Statistical and Adaptive Signal Processing 





Least-squares equations Normal equations 
X c X"X Cis 
z 
‘o/2 MxM = 
o 
3 EA N>M 
55 
3 R d 
Number of 
coefficientsM 
X 
x" 
xf! 
x = 
Xj 
y 5 
x d = xy 
xf i a= y 


WR = 


FIGURE 7.3 ; 
The LS problem and computation of the normal equations. 
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FIGURE 7.4 

Classification of different computational algorithms for the solution of the LS problem. 

Geometric derivation. We may think of the desired response record y and the data records x,, 1<k <M , as 
vectors in an N-dimensional vector space, with the dot product and length defined by 


N-1 
<55> $ £72,=> x MxM) (7.2.17) 


n=0 


š -l 
and lz 2 <z> =¥ |xmP=E, Pp 
n=0 


respectively. The estimate of the desired response record can be expressed as 


CHAPTER 7 Least-Squares Filtering and Prediction 225 


M 
p=Xe=)) GX: (7.2.19) 
k=l 
that is, as a linear combination of the data records. 

The M vectors x, form an M-dimensional subspace, called the estimation space, which is the column space of 
data matrix X. Clearly, any estimate y must lie in the estimation space. The desired response record y , in general, 
lies outside the estimation space. The estimation space for M =2 and N=3 is illustrated in Figure 7.5. The 
error vector e points from the tip of y to the tip of y. The squared length of @ is minimum when e is 
perpendicular to the estimation space, thatis, e Lg, for l1<k<M. 





FIGURE 7.5 
Vector space interpretation of LSE estimation for MN =3 (dimension of data space) and M =2 (dimension of estimation subspace). 
Therefore, we have the orthogonality principle 


<3,0> = Ze =0 1<k<M (7.2.20) 
or more compactly 


X4%e=X"(y—Xe,)=0 


or (X"X)e,= X"y (7.2.21) 


which we recognize as the LSE normal equations from (7.2.12). 
The LS solution splits the desired response y into two orthogonal components, namely, f, and €y. 
Therefore, 
2 
Iv =| 


P +lle, | (7.2.22) 








die 
and, using (7.2.18) and (7.2.19), we have 
E, =E, -cp X" Xe, = E, -cp X"y : (1.2.23) 


which is identical to (7.2.13). The normalized total squared error is 





is —j__ 3 (7.2.24) 


which is in the range 0 < € <1, with limits of O and 1, which correspond to the worst and best cases, respectively. 
Uniqueness. The solution of the LSE normal equations exists and is unique if the time-average correlation 
matrix R is invertible. We shall prove the following: 
THEOREM 7.1. The time-average correlation matrix R= XX is invertible if and only if the columns %, 
of X are linearly independent, or equivalently if and only if R is positive definite. 
Proof. If the columns of X are linearly independent, then for every z #0 we have Xz #0. This implies that for 
every z#0 


z"(X"X)z=(Xz)" Xz =||Xz| >0 (7.2.25) 
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that is, R is positive definite and hence nonsingular. 

If the columns of X are linearly dependent, then there is a vector Zọ #0 such that X¥z,=0. Therefore, 
X" xz, =0, which implies that R= X"X_ is singular. 

For a matrix to have linearly independent columns, the number of rows should be equal to or larger than the 
number of columns; that is, we must have more equations than unknowns. To summarize, the overdetermined 
(N>M) LS problem has a unique solution provided by the normal equations in (7.2.12) if the time-average 
correlation matrix R is positive definite, or equivalently if the data matrix X has linearly independent columns. 

In this case, the LS solution can be expressed as 


c =X*y (7.2.26) 


where X+t2(X"X)' Xx" (1.2.27) 


isan MXN matrix known as the pseudo-inverse or the Moore-Penrose generalized inverse of matrix X (Golub and 
Van Loan 1996; Strang 1980). 
The LS estimate fı of y can be expressed as 


Jy, = Py (7.2.28) 
where P2x(x" X)'X" (7.2.29) 


is know as the projection matrix because it projects the data vector y onto the column space of X to provide the LS 
estimate ış of y. Similarly, the LS error vector €; can be expressed as 


e, =(U—-P)y (7.2.30) 

where J is the NXN identity matrix. The projection matrix P is Hermitian and idempotent, that is, 
Pap” (7.2.31) 
and PoP? =P (7.2.32) 


respectively. 

When the columns of X are linearly dependent, the LS problem has many solutions. Since all these solutions 
satisfy the normal equations and the orthogonal projection of y onto the column space of X is unique, all these 
solutions produce an error vector e of equal length, that is, the same LSE. This subject is discussed in Section 7.6.2 
(minimum-norm solution). 


EXAMPLE 7.2.1 Suppose that we wish to estimate the sequence y=[l 2 3 2]' from the observation vectors 
¥,=[121 1 and ,=[2 1 2 3f . Determine the optimum filter, the error vector @),, andthe LSE E, . 
Solution. We first compute the quantities 


1 Z 3 ji avi 

Rextxy=|2 1] |? sll 4 i= xy? 1 allie! 
1 2i|1 2| |9 18 1 2||3| |16 
1 3/1 3 i 3||2 


2 1 4 
a ois 10] |5 
¢,=R'd=|° > à 
1 7 |[16} |22 

5 45 45 


and the LSE 
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d 4 
10] | 5 |_98 
E, =E, -åÂ'c, =18- =— 
Is y d ls i 2 45 
45 
The projection matrix is 
2121 
9 9 9 3 
1 @ 1 a 
P=X(X'™X)'X' = 9 45 9 15 
a A eS 
9 9 9 3 
I A 3 3 
3 me 3s. 3 
which can be used to determine the error vector 
7 4 11 45 
e, = _ =|-—- -— — —— 
whose squared norm is equal to lle, |’ = 98 =E,» as expected. We can also easily verify the orthogonality 
F 45 S 


principle e %, =e), x, =0- 
Weighted least-squares estimation. The previous results were derived by using an LS criterion that treats every 


error e(n) equally. However, based on a priori information, we may wish to place greater importance on different 
errors, using the weighted LS criterion 


N-I 
E, => a@n)|e(n) P=e"We (7.2.33) 
n=0 
where W =diag{a@0), @(1), ---, @N-1)} (7.2.34) 


is a diagonal weighting matrix with positive elements. Usually, we choose small weights where the errors are 


expected to be large, and vice versa. Minimization of E, with respect to € yields the weighted LS (WLS) 
estimator 


Cy), = (X"WX) |X "Wy (7.2.35) 
assuming that the inverse of the matrix X"WX exists. We can easily see that when W =I, then Cys =c,,. The 


criterion in (7.2.33) can be generalized by choosing W to be any Hermitian, positive definite matrix (see Problem 
7.2). 


7.2.2 Statistical Properties of Least-Squares Estimators 


A useful approach for evaluating the quality of an LS estimator is to study its statistical properties. Toward this end, 
we assume that the obtained measurements y actually have been generated by 


y=Xeote, (7.2.36) 


where e, is the random measurement error vector. We may think of c, as the “true” parameter vector. Using 
(7.2.36), we see that (7.2.21) gives 


e,=¢,+(X"X) Ke, (7.2.37) 


We make the following assumptions about the random measurement error vector e, : 
1. The error vector e, has zero mean 
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E{e,}=0 (7.2.38) 


2. The error vector e, has uncorrelated components with constant variance oÈ ; that is, the correlation matrix is 
given by 
R, =E{e e; }=0}I (1.2.39) 
3. There is no information about e, contained in data matrix X; that is, 
Efe, |X} = E{e,}=0 (7.2.40) 


4. If X is a deterministic NXM matrix, then it has rank M. This means that X is a full-column rank and that 
XĦX is invertible. If X is a stochastic NXM matrix, then E{(X"X)"} exists. 
In the following analysis, we consider two possibilities: X is deterministic and stochastic. Under these conditions, 
the LS estimator Cı, has several desirable properties. 


Deterministic data matrix 


In this case, we assume that the LS estimators are obtained from the deterministic data values; that is, the matrix 
X is treated as a matrix of constants. Then the properties of the LS estimators can be derived from the statistical 
properties of the random measurement error vector e, . 


PROPERTY 7.2.1 The LS estimator c, is an unbiased estimator of c, , that is, 
Efc, } = c, 
Proof. Taking the expectation of both sides of (7.2.37), we have 
E{c,,}= E{e,}+(X"X)' X "Efe, }=c, (7.2.41) 

because X is deterministic and E{e,}=0. 
PROPERTY 7.2.2 The covariance matrix of Cı, corresponding to the error ¢,,—c, is 

T, = E{(c, -¢, le, —¢,)"} = o; (X"X)'= o; R` (7.2.42) 
Proof. Using (7.2.37), (7.2.39), and the definition (7.2.42), we easily obtain 

I, = (X*X)'X"Efe e" }X(X"Xyt= o; (xy 


Note that the diagonal elements of matrix o? R` are also equal to the variance of the LS combiner vector Cy . 


PROPERTY 7.2.3 An unbiased estimate of the error variance 0, is given by 
62, =(E,/N-M) (7.2.43) 


where N is the number of observations, M is the number of parameters, and E, is the LS error. 
Proof. Using (7.2.30) and (7.2.36), we obtain 
e, =U -P)y=U-Poe, 


which results in 
E, =eře, =e"(1-P)"(U—P)e, =e" (I —P)e, 


because of (7.2.32). Since E, depends on e, , itis a random variable whose expected value is 
E{E,,} = Efe} (I — P)e ,} = E{tr[(I — P)e e; 1} 


=tr[(I — P)E{e,e"}]=o07tr — P) 
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since tt(AB)= tr(BA), where tr is the trace function. However, 
tr —P) = tr — X(X"X)'X"] 
= tT yy —(X"X) XEN] 
= tr(Ty,y) —tl(X 9X) X"X] 
= ty) tUm) = N-M 


therefore o = E{E,} (7.2.44) 
“ N-M 


which proves that oe is an unbiased estimate of g? : 


Similar to (7.2.41), the mean value of Cw is 


E{c,,,,} = E{c,}+(X "WX)' X"WE{e,} = E{c,} (7.2.45) 


wls 


that is, the WLS estimator is an unbiased estimate of c, . The covariance matrix of Cw is 


Ty, =(X"WX)'X" WR. WX (X "WX (7.2.46) 
where R, is the correlation matrix of e, . It is easy to see that when Re =021 and W =I, we obtain (7.2.42). 


PROPERTY 7.2.4 The trace of T,,, attains its minimum when W = R;' . The resulting estimator 
Cmy =(XYRIX)'X" Ry (7.2.47) 


is known as the minimum variance or Markov estimator and is the best linear unbiased estimator ( BLUE) . 


Proof. The proof is somewhat involved. Interested readers can see Goodwin and Payne (1977) and Scharf 
(1991). 


PROPERTY 7.2.5 If R., = o? I, the LS estimator C,, is also the best linear unbiased estimator. 


Proof. It follows from (7.2.47) with the substitution R, =o? J. 


PROPERTY 7.2.6 When the random observation vector e, has a normal distribution with mean zero and 


correlation matrix R., = 02 I , that is, when its components are uncorrelated, the LS estimator Cj, is also 


the maximum likelihood estimator. 


Proof. Since the components of vector e, are uncorrelated and normally distributed with zero mean and variance o?, the 
likelihood function for real-valued e, is given by 


N- 1 Je ( )P 
L(c)= | ] = Leo (7.2.48) 
(c) l Jon I e| 


20; 
and its logarithm by 


(y — Xc)" (y — Xc) + const (1.2.49) 





In L(c) = -z rete, -2 inane? )=- L 


For complex-valued e,, the terms V270, and 202 in (7.2.48) are replaced by 10} and g? , respectively. Since the 


logarithm is a monotonic function, maximization of L(c) is equivalent to minimization of In L(c). It is easy to see, by 


comparison with (7.2.8), that the LS solution maximizes this likelihood function. 
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Stochastic data matrix 


We now extend the statistical properties of c,, from the preceding section to the situation in which the data 
values in X are obtained from a random source with a known probability distribution. This situation is best handled 
by first obtaining the desired results conditioned on X, which is equivalent to the deterministic case. We then 
determine the unconditional results by (statistical) averaging over the conditional distributions using the following 
properties of the conditional averages. 

The conditional mean and the conditional covariance of a random vector x($), given another random vector 
y(¢), are defined by 


Hy, = E{x($) | ¥()} 


and Py, Ê E([x(Q)— My Mx) - Ay Py} 
respectively. Since both quantities are random objects, it can be shown that 
H4, = E{x(Q)}=E,{E{x(S)| y(O)}} 
which is known as the law of iterated expectations and that 
r= Ty tae, 


which is called the decomposition of the covariance rule. This rule states that the covriance of a random vector x (¢) 
decomposes into the covariance of the conditional mean plus the mean of the conditional covariance. The covariance 
of the conditional mean, Myy, is given by 


Pine = E Alky- Be WAgy Hed} 


where the notation T}? indicates the covariance over the distribution of y(¢). More details can be found in 


Greene (1993). 


PROPERTY 7.2.7. The LS estimator C, is an unbiased estimator of C, . 
Proof. Taking the conditional expectation with respect to X of both sides of (7.2.37), we obtain 
E{c,, | X}=E{c, |X}+(X"X)'X"Efe, |X} (7.2.50) 
Now using the law of iterated expectations, we get 
E{c,,}= Ey {E{e,, | X}}=c, +E{(X"X)'X"Efe, | X}} 
Since E{e, | X}=0, from assumption 3, we have E{e,,}=c¢,.Thus cą is also unconditionally unbiased. 


PROPERTY 7.2.8 The covariance matrix of ¢,, corresponding to the error C—C, is 
I, Ê E{(¢,, -¢, )(¢, —¢,)"} =o, E{(X"XY"} (7.2.51) 
Proof. From (7.2.42), the conditional covariance matrix of c, , conditional on X, is 
E{(c,,—¢, )(e, —¢,)" |X} = 0} (X"X)" (7.2.52) 


For the unconditional covariance, we use the decomposition of covariance rule to obtain 
E{(c,, =c, Cy —c,)"} = Ey {E{ (Cs —C¢, Me, -c,)" | X} } 
+E, {(Efe,, |X}-c, Etle, |X}-c,)"} 


The second term on the right-hand side above is equal to zero since E{c,, | X}=c¢c, and hence 
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Ef (Er =€, XC —¢,)" } = Ex {Ef (€ =E, (e, —¢,)" | X}} 
= Ey {02 (X*X)"}= 02 E{(X*X)} 
Thus the earlier result in (7.2.42) is modified by the expected value (or averaging) of (X”X)". 
One important conclusion about the statistical properties of the LS estimator is that the results obtained for the 


deterministic data matrix X are also valid for the stochastic case. This conclusion also applies for the Markov 
estimators and maximum likelihood estimators (Greene 1993). 


7.3 Least-Squares FIR Filters 


We will now apply the theory of linear LS error estimation to the design of FIR filters. The treatment closely follows 
the notation and approach in Section 5.4. Recall that the filtering error is 


e(n)= 0) - 5" h(k) x(n—k) Ê y(n)—e"x(n) (7.3.1) 
where y(n) is the desired response, > 
x(n) =[x(n) x (n-1) ++ x(n-M +)" (7.3.2) 
is the input data vector, and 
e= G e Cual (1.3.3) 


is the filter coefficient vector related to impulse response by c, = h* (k). Suppose that we take measurements of the 
desired response y(n) and the input signal x(n) over the time interval 0 < n < N-1. We hold the 
coefficients {c,}#— of the filter constant within this period and set any other required data samples equal to zero. 
For example, at time n=(, that is, when we take the first measurement x(0), the filter needs the samples x(0), 
x(—1),:--,x(-M +1) to compute the output sample (0). Since the samples x(—1), ---, x(-M +1) are not 
available, to operate the filter, we should replace them with arbitrary values or start the filtering operation at time 
n=M -1. Indeed, for M —1 < n < N-1, all the input samples of x(n) required by the filter to compute the 
output {}(n)}4/-| are available. If we want to compute the output while the last sample x(N —1) is still in the 
filter memory, we must continue the filtering operation until n = N +M —2. Again, we need to assign arbitrary 
values to the unavailable samples x(N),---,x(N +M —2). Most often, we set the unavailable samples equal to zero, 
which can be thought of as windowing the sequences x(n) and y(n) with a rectangular window. To simplify the 
illustration, suppose that N=7 and M =3. Writing (7.3.1) for n=O, 1, ---, N+M —1 and arranging in matrix 
form, we obtain 


















0> y (0) 
yA 
M -1> y 2| 1D) xd) 
yB) 16 #2) x) la 134) 
=| y"(4) |-] A ÄB Lla E 
YOI |x") x(4) FEG 
N= y(6)| 1O x (5) x (4) 
o || o x) x*(5) 
N+M -2-| e* (8) 0 0 0 x’ (6) 
or, in general, 
e=y-Xc (7.3.5) 


where the exact form of e, y , and X depends on the range N; < n < Ny of measurements to be used, which in 
turn determines the range of summation 
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E= 3 |e(n) =e"e (7.3.6) 
=N 
in the LS criterion. The LS FIR filter is found by solving the LS normal equations 
(XX) = xX" y (7.3.7) 
or Re, =â (7.3.8) 
with an LS error of 
E, =E,-d'%, (7.3.9) 


where E, is the energy of the desired response signal. The elements of the time-average correlation matrix R are given 
by 
Ny 
Py=EEE,= Do x(ntl-)x'(n+1-j) 1<siij <M (7.3.10) 


n=N; 
where x; are the columns of data matrix X. A simple manipulation of (7.3.10) leads to 
Fist jut = Py + X(N, ix (N, — jf) - X(N, +1-)x"(N, +1- j) 1<iij<M (7.3.11) 


which relates the elements of matrix R that are located on the same diagonal. This property holds because the 
columns of X are obtained by shifting the first column. The recursion in (7.3.11) suggests the following way of 
efficiently computing R: 
1. Compute the first row of R by using (7.3.10). This requires M dot products and a total of about 
M(N; —Ni) operations. z 
2. Compute the remaining elements in the upper triangular part of R , using (7.3.11). This required number of 
operations is proportional to M? . 
3. Compute the lower triangular part of R, using the Hermitian symmetry relation 7 ;; = f} - 
Notice that direct computation of the upper triangular part of R using (7.3.10), that is, without the recursion, 
requires approximately M*N/2_ operations, which increases significantly for moderate or large values of M. 
There are four ways to select the summation range N; Sn< N; that are used in LS filtering and prediction: 


No windowing. If we set N;=M-—1 and N; =N-1, we only use the available data and there are no 
distortions caused by forcing the data at the borders to artificial values. 


Prewindowing. This corresponds to N;=0 and N;=N-—1 and is equivalent to setting the samples 
x(0), x(—1), ---, x(-M +1) equal to zero. As a result, the term x(M —i)x(M — j) does not appear in (7.3.1). 
This method is widely used in LS adaptive filtering. 


Postwindowing. This corresponds to N;=M-1 and Ns =N+M -—2 and is equivalent to setting the 
samples x(N), ---, x(N+M —2) equal to zero. As a result, the term x(M —i)x(M — j) does not appear in 
(7.3.11). This method is not used very often for practical applications without prewindowing. 


Full windowing. In this method, we impose both prewindowing and postwindowing (full windowing) to the 
input data and postwindowing to the desired response. The range of summation is from N;=0 to 
N; =N+M —2, and as a result of full windowing, Eq. (7.3.11) becomes F;i+ı,j+ı = Fy . Therefore, the elements ĵ;, 
depend on i—j, and matrix R is Toeplitz. In this case, the normal equations (7.2.12) can be obtained from the 
Wiener-Hopf equations (5.4.11) by replacing the theoretical autocorrelations with their estimated values (see Section 
4.2). 

Clearly, as N>>M_ the performance difference between the various methods becomes insignificant. The 
no-windowing and full-windowing methods are known in the signal processing literature as the autocorrelation and 
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covariance methods, respectively (Makhoul 1975b). We avoid these terms because they can lead to misleading 
statistical interpretations. We notice that in the LS filtering problem, the data matrix X is Toeplitz and the normal 
equations matrix R=X"X is the product of two Toeplitz matrices. However, R is Toeplitz only in the 
full-windowing case when X is banded Toeplitz. In all other cases R is near to Toeplitz or R is close to Toeplitz ina 
sense made precise in Morf, et al. (1977). 

The matrix R and vector d, for the various windowing methods, are computed by using the MATLAB function 
[R,d] =lsmatvec(x, M, method, y), which is based on (7.3.10) and (7.3.11). Then the LS filter is computed 
by cls=R\d . Figure 7.6 shows an FIR LSE filter operating in block processing mode. 


N No 















Compute and 
solve normal 
equations 






x(n) Frame 


blocking 


FIR Frame 
filter unblocking 





y(n) 


FIGURE 7.6 
Block processing implementation of an FIR LSE filter. 


EXAMPLE 7.3.1 To illustrate the design of least-squares FIR filters, suppose that we have a set of measurements of x(n) and 
y(n) for 0 < n < N—1 with N =100 that have been generated by the difference equation 
y(n) =0.5x(n) +0.5x(n—1) + v(n) 
The input x(n) and the additive noise y(n) are uncorrelated processes from a normal (Gaussian) distribution with mean 
E{x(n)} = E{v(n)}=0 and variance o? = 0? =1. Fitting the model 
$(n) = h(O)x(n) + h()x(n—1) 
to the measurements with the no-windowing LS criterion, we obtain 
0.5361 0.0073 —0.0005 
5 | 0.5570 —0.0005 0.0071 
using (7.3.7), (7.3.9), (7.2.44), and (7.2.42). If the mean of the additive noise is nonzero, for example, if E{v(n)}=1, we get 
0.4889 3 zai 0.0131 —0.0009 
C, = 6. =1.8655 GR = 
0.5258 —0.0009 0.0127 
which shows that the variance of the estimates, that is, the diagonal elements of G? R` , increases significantly. Suppose now 


that the recording device introduces an outlier in the input data at x(30)=20. The estimated LS model and its associated 
statistics are given by 


| 62=1.0419  &R` -| 


Cis = 


0.1814 0.0000 0.0030 


Similarly, when an outlier is present in the output data, for example, at y(30) = 20, then the LS model and its statistics are 
0.6303 5 Ja 0.0357 —0.0025 
s = ô: =5.0979 G-R = 
0.4653 —0.0025 0.0347 


In general, LS estimates are very sensitive to colored additive noise and outliers (Ljung 1987). Note that all the LS solutions in this 
example were produced with one sample realization x(n) and that the results will vary for any other realizations. 


0.1796 >~-1__| 9.0030 0.0000 
6-=1.6270 GER = 


LS inverse filters. Given a causal filter with impulse response g(n), its inverse filter h(n) is specified by 
g(n)*h(n)=6(n—no), n 20. We focus on causal inverse filters, which are often infinite impulse response (IIR), 
and we wish to approximate them by some FIR filter c,,(m)=h*(n) that is optimum according to the LS criterion. 
In this case, the actual impulse response g(n) *C, (n) of the combined system deviates from the desired response 
O(n—ng), resulting in an error e(n). The convolution equation 


e(n) <dti—1)-5 den- (7.3.12) 


can be formulated in matrix form as follows for M =2 and N=6 
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g (0) 0 0 
g(l) g(0) 0 
g(2) g’) 
g°(3) (2) cy (0) 
g (4) g°(3) c, (1) 
g5) gA Cx (2) 
g6) g°(5) 

0 g'(6) g5) 

0 0 g6) 





oojooo co OjO m= 
| 





assuming that n, =0. In general, 


e=6,-Ge. (7.3.13) 
where ĝ is a vector whose ith element is 1 and whose remaining elements are all zero. The LS inverse filter and 
the corresponding error are given by 


(G"G)c\? =G"6, (7.3.14) 
and EC =1-6'GeP =1- g" (cC O<SiSM+N (7.3.15) 
Using the projection operators (7.2.29) and (7.2.30), we can express the LS error as 
E® =6)(P-1)"(P-1)6, (7.3.16) 
where P=G(G"G)'G" (7.3.17) 
The total error for all possible deldyaO< i < N+M canbe written as 

N+M 

Eva = >, EË = te[D"(P-1)"(P-1)D] (7.3.18) 
i=0 

where D=[6, 6, 5) ++ Oy,yJ=1 (7.3.19) 


is the [(N+M+1)x(N +M +1) identity matrix. Since D=I, P=P",and P? =P, we obtain 
Eou = t{D" (P -1)"(P -1)D) = te — P) = tr(1)- tr(P) 


or E oa = N (7.3.20) 
because tr(D=N+M+1 and 


tr(P) = t[G(G"G) 'G"]=t[G"G(G"G)"]=M +1 


Hence, Ewa depends on the length N+1 of the filter g(”) and is independent of the length M +1 of the 
inverse filter c,,(n).If the minimum £E(”, for a given N, occurs at delay i =i), we have 
Re) s —N (7.3.21) 
N+M+1 
which shows that EX"? +0 as M — œ (Claerbout and Robinson 1963). 
EXAMPLE 7.3.2 Suppose that g(n)=6(n)—a@d(n—1), where œ is areal constant. The exact inverse filter is 





H(z)= —h(n) = a"u(n) 


l-a@z' 
and is minimum-phase only if -1 < œ < 1. The inverse LS filter for M =1 and N > 2 is obtained by 
applying (7.3.14) with 

1 0 1 
G=|-a 1 and d6=/0 
0 
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The normal equations are 











1+@? -æ || ¢,,(0) 1 (7.3.22) 
-a Il+a’||c,@} [0 
leading to the LS inverse filter 
1+a@’ a 
0)= j=“ 
4 (0) 1+0 +a a) 1+0 +a 
with LS error 
4 
a 
E, =1-—¢,(0) = — r 
i i l+a°+a' 
The system function of the LS inverse filter is 
l+a° a. 
A(z) = ee es 7% i 
l+æ +a l+a@ 





and has a zero at z, =-@/(1+a@’)=-1/(a@+a@"). Since |z,|<1 for any value of a, the LS inverse filter is 
minimum-phase even if g(n) is not. This stems from the fact that the normal equations (7.3.22) specify a 
one-step forward linear predictor with a correlation matrix that is Toeplitz and positive definite for any value of 
a (see Section 6.4). 


7.4 Linear Least-Squares Signal Estimation 


We now discuss the application of the LS method to general signal estimation, FLP, BLP, and combined forward and 
backward linear prediction. The reader is advised to review Section 5.5, which provides a detailed discussion of the 
same problems for the MMSE criterion. The presentation in this section closely follows the viewpoint and notation in 
Section 5.5. 


7.4.1 Signal Estimation and Linear Prediction 


Suppose that we wish to compute the linear LS signal estimator c{') defined by 


M 
e(n)= >of” x(n—k) =e" X(n)  withc® £1 (7.4.1) 
k=0 


from the data x(n), 0 < n < N-1. Using (7.4.1) and following the process that led to (7.3.4), we obtain 


e = x, (7.4.2) 
x" (0) 0 oo 0 
x (1) x’ (0) te 0 
where x(M) x(M-l) -: x (0) (7.4.3) 
X= : : : 
x(N-1) x (N-2) + x°(N-M -1) 
0 x(N-1) = x*(N-M) 
0 0 tee x (N -1) 


is the combined data and desired response matrix with all the unavailable samples set equal to zero (full windowing). 
Matrix X can be partitioned columnwise as 


X =[X, y X] (1.4.4) 
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where y, the desired response, is the ith column of X. Using (7.4.4), we can easily show that the LS signal estimator 
c{ and the associated LS error E® are determined by 


0 
(XX yc? =| E® (1.4.5) 
0 


where EY is the ith element of the right-hand side vector (see Problem 7.3). If we define the time-average 
correlation matrix 

R=X"X (7.4.6) 
and use the augmented normal equations in (7.4.5), we obtain a set of equations that have the same form as (5.5.12), 
the equations for the MMSE signal estimator. Therefore, after we have computed R, using the command 
Rbar=lsmatvec(x, M+1, method), we can use the steps in Table 5.3 to compute the LS forward linear 
predictor (FLP), the backward linear predictor (BLP), the symmetric smoother, or any other signal estimator with 
delay i. Again, we use the standard notation E{° = E/ and c® =a forthe FLPand E =E’ and c\ =b 
for the BLP. 

All formulas given in Section 5.5 hold for LS signal estimators if the matrix R(n) is replaced by R . However, 
we stress that although the optimum MMSE signal estimator c‘(n) is a deterministic vector, the LS signal 
estimator c? is a random vector that is a function of the random measurements x(n), 0 < n < N-1. In the 
full-windowing case, matrix R is Toeplitz; if it is also positive definite, then the FLP is minimum-phase. Although 
the use of full windowing leads to these nice properties, it also creates some “edge effects” and bias in the estimates 
because we try to estimate some signal values using values that are not part of the signal by forcing the samples 
leading and lagging the available data measurements to zero. 


EXAMPLE 7.4.1. Suppose that we are given the signal segment x(n)=a@", 0 <n < N, where œ is an arbitrary 
complex-valued constant. Determine the first-order one-step forward linear predictor, using the full-windowing and no-windowing 
methods. 
Solution. We start by forming the combined desired response and data matrix 
wit. x(0) x) -- = x(N) 0 
-| 0 x0 ++ x(N-1) a 


For the full-windowing method, the matrix 


_ s AO AO 
_ HY — rx x 
R en s 


is Toeplitz with elements 





N 5 N ie | apem 
#0) = Di x(n) P= Dla = — 
n=0 n=0 1-| a| 
d ` : X *\n-1 * l-l æ P” 
5 ADY xnxx (n-1) =) aay = ——_ 
n=l n=l 1- | a | 
Therefore, we have 
#0) #0) ]| 1 |_[ By 
AO PO aP 0 
whose solution gives 
OAC er Ila P” 
' RO  1-lakO" 
4 R j | a peN 
i Bf =PO) +A Oa? = 


Since for every sequence |7,(/)| < |#,(0)|. we have |a{ | < 1; that is, the obtained prediction error filter always is 


minimum-phase. Furthermore, if | @|<1, then lim a” =-—q and lim Ef =1 = x(0) - In the no-windowing case, the matrix 
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is Hermitian but not Toeplitz with elements 





N 1 lap N-1 1 3 
fu= x(n)? = Jaf sn = (np = lek 
rii 2 | | | par r2 2, k(n THe 

N x .1-|apP" 
ro= x(n) x (n—1) = œ ——— 
iz 3 aaa alr 

Solving the linear system 
hal 1 |- El 
f2 Poo lla 0 
we obtain Po 
hy 
and Ei =futfnal? =0 


We see that the no-windowsing method provides a perfect linear predictor because there is no distortion due to windowing. 
However, the obtained prediction error filter is minimum-phase only whenl œ |< 1. 


EXAMPLE 7.4.2 To illustrate the statistical properties of least-squares FLP, we generate K =500 realizations of the MA(1) 
process x(n) = omt onl , where @(n)~WN(0,1) (see Example 5.5.2). Each realization x(€,,n) has duration 
2 


N =100 samples. We use these data to design an M =2 order FLP, using the no-windowing LS method. The estimated mean 
and variance of the obtained K FLP vectors are 


-0.4695 : (az) [00086 
0.1889 | ANAS) 0.0092 


whereas the average of the variances G? is 0.9848. We notice that both means are close to the theoretical values obtained in 
Example 5.5.2. The covariance matrix of a given LS estimate q, was found to be 


Mean{a(¢,)} -| 


„anı _[ 0.0099 -0.0043 
OR =| L00043 0.0099 


whose diagonal elements are close to the components of var {a} , as expected. The bias in the estimate a, results from the fact 
that the residuals in the LS equations are correlated with each other (see Problem 7.4). 


7.4.2 Combined Forward and Backward Linear Prediction (FBLP) 


For stationary stochastic processes, the optimum MMSE forward and backward linear predictors have even conjugate 
symmetry, that is, 


a, = Jb; (7.4.7) 
because both directions of time have the same second-order statistics. Formally, this property stems from the Toeplitz 


structure of the autocorrelation matrix (see Section 5.5). However, we could possibly improve performance by 
minimizing the total forward and backward squared error 


E” = > {Je'(n) P +e n) P} =(e')*e' +(e") e” (7.4.8) 
n=N, 


under the constraint 
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a” ĉa = jp (1.4.9) 
The FLP and BLP overdetermined sets of equations are 
p hA a ab 
e =X and e =X (7.4.10) 
a 1 
—| 1 j 1 
or ef -z| n and e™= |’ jew 5 (7.4.11) 
a 1 a 











where we have used (7.4.9) and the property JJ =I of the exchange matrix. If we combine the above two 
equations as 


, =| Ps 


then the forward-backward linear predictor that minimizes E ® is given by (see Problem 7.5) 


1 


g” 


£ (1.4.12) 


e? 




















Xx ji y | 1) fe 
XJ) LxJ ilar} | 0 
=Hs —T =+ l Es 
or XK 4K |=| (7.4.13) 
ais 0 








which can be solved by using the steps described in Table 5.3. The time-average forward-backward correlation matrix 


A 


Ro =X"X+IX' XJ (7.4.14) 


with elements 


Py = Py t PM-im-j 0<ij<M (7.4.15) 


is persymmetric; that is, JJ = Ên and its elements are conjugate symmetric about both main diagonals. In 


MATLAB we compute R,, by these commands: 
Rbar=lsmatvec(x, M+1, method) 
Rfb=Rbar+flipud(fiplr(conj (Rbar) ) ) 
The FBLP method is used with no windowing and was originally introduced independently by Ulrych and Clayton 
(1976) and Nuttall (1976) as a spectral estimation technique under the name modified covariance method (see Section 


8.2). If we use full windowing, then a” = (a + Jb*)/2 (see Problem 7.6). 


7.4.3 Narrowband Interference Cancelation 


Several practical applications require the removal of narrowband interference (NBI) from a wideband desired signal 
corrupted by additive white noise. For example, ground and foliage-penetrating radars operate from 0.01 to 1 GHz 
and use either an impulse or a chirp waveform. To achieve high resolution, these waveforms are extremely wideband, 
occupying at least 100 MHz within the range of 0.01 to 1 GHz. However, these frequency ranges are extensively 
used by TV and FM stations, cellular phones, and other relatively narrowband (less than 1 MHz) radio-frequency (RF) 
sources. Clearly, these sources spoil the radar returns with narrowband RF interference (Miller et al. 1997). Since the 
additive noise is often due to the sensor circuitry, it will be referred to as sensor thermal noise. Next we provide a 
practical solution to this problem, using an LS linear predictor. Suppose that the corrupted signal x(n) is given by 
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x(n) = s(n) + y(n) + v(n) (7.4.16) 
where 5(n)= signal of interest (7.4.17) 
y(n)= narrowband interference 
v(n)= thermal (white) noise 
are the individual components, assumed to be stationary stochastic processes. 
We wish to design an NBI canceler that estimates and rejects the interference signal y(n) from the signal 


x(n) , while preserving the signal of interest s(n). Since signals y(n) and x(n) are correlated, we can form an 
estimate of the NBI using the optimum linear estimator 


3(n) =c"x(n—D) (7.4.18) 

where Rc, =d (7.4.19) 
R = E{ x(n- D)x” (n- D)} (1.4.20) 

d = E{ x(n- D) y` (n)} (7.4.21) 


and D is an integer delay whose use will be justified shortly. Note that if D=1, then (7.4.18) is the LS forward linear 
predictor. If $(n)= y(n), the output of the canceler is x(n)— }(n) =s(n)+v(n); that is, the NBI is completely 
excised, and the desired signal is corrupted by white noise only and is said to be thermal noise—limited. 

Since, in practice, the required second-order moments are not available, we need to use an LS estimator instead. 
However, the quantity X” y in (7.2.21) requires the NBI signal y(n), which is also not available. To overcome 
this obstacle, consider the optimum MMSE D-step forward linear predictor 


ef (n)=x(n)+a" x(n- D) (7.4.22) 
Ra =-r' (1.4.23) 

where R is given by (7.4.20) and 
rf = E{x(n—D)x'(n)} (7.4.24) 


In many NBI cancelation applications, the components of the observed signal have the following properties: 

1. The desired signal s(n), the NBI y(n), and the thermal noise y(n) are mutually uncorrelated. 

2. The thermal noise v(n) is white; that is, r,(/)=o07d(1). 

3. The desired signal s(n) is wideband and therefore has a short correlation length; that is, r,(/)=0O for |/|2D. 

4. The NBI has a long correlation length; that is, its autocorrelation takes significant values over the range 
Os|/|SM for M>D. 

In practice, the second and third properties mean that the desired signal and the thermal noise are approximately 
uncorrelated after a certain small lag. These are precisely the properties exploited by the canceler to separate the NBI 
from the desired signal and the background noise. 

As a result of the first assumption, we have 


E{x(n—k)y'(n)} = E{y(n—k)y'(n)} = r (k) for all k (7.4.25) 


and rQ=r@M+rM+rM (7.4.26) 
Making use of the second and third assumptions, we have 
rl) =r,() for 1#0,1, ---, D-1 (7.4.27) 


The exclusion of the lags for 140, 1, ---, D—1 inr and d is critical, and we have arranged for that by forcing the 
filter and the predictor to form their estimates using the delayed data vector x(n—D) . From (7.4.21), (7.4.24), and 
(7.4.27), we conclude that d =r" and therefore c, =a,. Thus, the optimum NBI estimator c, is equal to the 
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D-step linear predictor a,, which can be determined exclusively from the input signal x(n) . The cleaned signal is 
x(n) — $(n) = x(n)+a¥x(n— D) = e' (n) (7.4.28) 


which is identical to the D-step forward prediction error. This leads to the linear prediction NBI canceler shown in 
Figure 7.7. 

Corrupted Cleaned 
signal signal 
x(n) 





Forward 


linear 
predictor 


FIGURE 7.7 
Block diagram of linear prediction NBI canceler. 

To illustrate the performance of the linear prediction NBI canceler, we consider an impulse radar operating in a 
location with commercial radio and TV stations. The desired signal is a short-duration impulse corrupted by additive 
thermal noise and NBI (see Figure 7.8). The spectrum of the NBI is shown in Figure 7.9. We use a block of data 
(N =4096) to design an FBLP with D=1 and M =100 coefficients, using the LS criterion with no windowing. 
Then we compute the cleaned signal, using (7.4.28). The cleaned signal, its spectrum, and the magnitude response of 
the NBI canceler are shown in Figures 7.8 and 7.9. We see that the canceler acts as a notch filter that optimally puts 
notches at the peaks of the NBI. A detailed description of the design of optimum least-squares NBI cancelers is given 
in Problem 7.27. 
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FIGURE 7.8 
NBI cancelation: time-domain results. 
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NBI canceler response: M = 100 
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FIGURE 7.9 
NBI cancelation: frequency-domain results. 


7.5 LS Computations Using the Normal Equations 


The solution of the normal equations for both MMSE and LSE estimation problems is computed by using the same 
algorithms. The key difference is that in MMSE estimation R and d are known, whereas in LSE estimation they 
need to be computed from the observed input and desired response signal samples. Therefore, it is natural to want to 
take advantage of the same algorithms developed for MMSE estimation in Chapter 6, whenever possible. However, 
keep in mind that despite algorithmic similarities, there are fundamental differences between the two classes of 
estimators that are dictated by the different nature of the the criteria of performance (see Section 8.1). In this section, 
we show how the computational algorithms and structures developed for linear MMSE estimation can be applied to 
linear LSE estimation, relying heavily on the material presented in Chapter 6. 


7.5.1 Linear LSE Estimation 


The computation of a general linear LSE estimator requires the solution of a linear system 
Re, =å (7.5.1) 


where the time-average correlation matrix R is Hermitian and positive definite [see (7.2.25)]. We can solve (7.5.1) by 
using the LDL” or the Cholesky decomposition introduced in Section 5.3. The computation of linear LSE estimators 
involves the steps summarized in Table 7.1. We again stress that the major computational effort is involved in the 
computation of R and d. 
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TABLE 7.1 


Comparison between the LDL”and Cholesky decomposition methods for the 
solution of normal equations. 
Step LDL” decomposition Cholesky decomposition Description 


1 R=X"X,d=X"y Normal equations Rc, =d 

2 R=LpDL" R= LL" Triangular decomposition 

3 LDk =d Lk=d Forward substitution +k ork 
4 L'c, =k fe, =k Backward substitution — c, 

5 E, =E, -k"Dk E, =E, -ġ"k LSE computation 

6 e, =y- Xc, e, =y- Xc, Computation of residuals 


Steps 2 and 3 in (5.3.16) can be facilitated by a single extended LDL” decomposition. To this end, we form the 
augmented data matrix 








X =[X y] (7.5.2) 
and compute its time-average correlation matrix 
-ao a"s ey R d 
R-¥"x-| , zh pi (7.5.3) 
yX yyl jå E, 
We then can show (see Problem 7.9) that the LDL” decomposition of R is given by 
— [L O||D OL? kë 
R=|_, f E- (7.5.4) 
k’ 1)0° E,||0” 1 





and thus provides the vector k and the LSE E. Therefore, we can solve the normal equations (7.5.1), using the 
LDL” decomposition of R to compute L and k and then solving L'c,, =k tocompute €y. 


A careful inpection of the design equations for the general, mth-order, MMSE and LSE estimators, derived in 
Chapter 5 and summarized in Table 7.2, shows that the LSE equations can be obtained from the MMSE equations by 


replacing the linear operator E{-} by the linear operator YO . As a result, all algorithms developed in Sections 


6.1 and 6.2 can be used for linear LSE estimation problems. 


TABLE 7.2 

Comparison between the MMSE and LSE normal equations for general linear 

estimation. 

MMSE LSE 

Available information R,(n), d,(n) {x„(n), y(n), n, Sn sn | 
Normal equations R,(n)c,,(n) =d, (n) Ron =dn 
Minimum error P.(n) = P.(n)—d*(n)e,(n) E, =E,-d£n 
Correlation matrix R,(n) = E{X,,(n)X"(n)} R,=X5X, => X,,(n)X"(n) 
Cross-correlation vector d_,(n) = E{X,(n)y'(n)} d,=X"y =È X,„(n)y (n) 
Power Pin)= Ell yP) E, =y"y=} |y} 


For example, we can easily see that R mi Âu» Lu» Du.and ky have the optimum nesting property 
described in Section 6.1.1, that is, Ên =R hs 1 and so on. As a result, the factors of the LDL” decomposition have 
the optimum nesting property, and we can obtain an order-recursive structure for the computation of the LSE estimate 


y,,(n) - Indeed, if we define 


w,(m)=Lix,(n) OSn<N-1 (7.5.5) 
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N-I N-1 
then Rn => Xn (x4 (n) = L,, È w,„(n)w' (n) ln $ L,D„LĽi (7.5.6) 
n=0 n=0 


where the matrix D, is diagonal because the LDL” decomposition is unique. If we define the record vectors 


w; È [w0 w0 = w,(N-DIF (7.5.1) 
and the data matrix 
W,, = [Wi 2 Wh] (7.5.8) 
then DD. =W iW, =diagif, é, =, É} (7.5.9) 
where é= s | w,(n) = ww; (7.5.10) 
From (7.5.9) we have j 
ww; =0 fori# j (7.5.11) 


that is, the columns of W,, are orthogonal and, in this sense, are the innovation vectors of the columns of data 
matrix X,,, according to the LS interpretation of orthogonality introduced in Section 7.2. 
Following the approach in Section 61.5, we can show that the following order-recursive algorithm 


-1 7(m-1)* 
ad 


Wp (N) = Xp (N) =i w,(n) 


(7.5.12) 
¥,(7) = În) + kmWm (n) 
or e,,(n) =e,,_,(n)—k, @, (n) 
computed forn=0, 1, ..., N-landm=1, 2, ..., M, provides the LSE estimates for orders 1 < m < M . 


The statistical interpretations of innovation and partial correlation for @,(n) and k,,,, hold now in a 
deterministic LSE sense. For example, the partial correlation between Y and X,,, is defined by using the 
residual records @,,=JY—X „Cm and €%,=X¥n4i1+Xmb,, where b,, is the least-squares error BLP. Indeed, if 
Bra â gtg? , we can show that msi = Bnsi/Smsi (see Problem 7.11). 


EXAMPLE 7.5.1 Solve the LS problem with the following data matrix and desired response signal: 
111 1 


2 
ya 
3 
1 


or N 
= We 


Solution. We start by computing the time-average correlation matrix and cross-correlation vector 


15 8 B 20 
R=|8 6 6 d=| 9 
13 6 12 18 
followed by the LDL” decomposition of R using the MATLAB function [L, D] =1d1t (X) . This gives 
1 0 0 15 0 0 
L=| 0.5333 1 0 D=| 0 1.7333 0 
0.8667 -—0.5385 1 0 0 0.2308 


and working through the steps in Table 7.1, we find the LS solution and LSE to be 
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c, =[3.0 -1.5 -1.07 =1.5 
Is s 


using the following sequence of MATLAB commands 


k=L\dhat; 

cls=L’'\k; 

Els=sum((y-X’*cls) .*2) ; 
These results can be verified by using the command cls=Rhat \dhat. 


7.5.2 LSE FIR Filtering and Prediction 


As we stressed in Section 6.3, the fundamental difference between general linear estimation and FIR filtering and 
prediction, which is the key to the development of efficient order-recursive algorithms, is the shift invariance of the 
input data vector 


X„a(n) = [x(n) x (n-1) «++ x(n—m+1) x(n—m)]" (7.5.13) 


The input data vector can be partitioned as 


tmn =| Xa (n) |- i l (1.5.14) 


x(n-m) X„(n—1) 
which shows that samples from different times are incorporated as the order is increased. This creates a coupling 


between order and time updatings that has significant implications in the development of efficient algorithms. Indeed, 
we can easily see that the matrix 


Ny 
Rint = Dd, pa MX (2) (7.5.15) 
n=N; 


can be partitioned as 





A p> Ef ~ fH 
Boer -| ee -| TH (1.5.16) 
fn En) (Pn Ra 
where R= Rn+Xn (N -DAAN -1-an NCN) (7.5.17) 


is the matrix equivalent of (7.2.28). We notice that the relationship between Ê! and R,,, which allows for the 


development of a complete set of order-recursive algorithms for FIR filtering and prediction, depends on the choice 


of N, and N , , that is, the windowing method selected. 
As we discussed in Section 7.3, there are four cases of interest. In the full-windowing case (N,=0, 


N, =N +M —2), we have Ê! =R,, and R,, is Toeplitz. Therefore, all the algorithms and structures developed 


in Chapter 6 for Toeplitz matrices can be utilized. 
In the prewindowing case ( N; =0, Ny = N —1), Equation (7.5.17) becomes 


ÊL = Rn- Xn (N-1I)xË(N -1) (1.5.18) 
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Since x,(n)=0 for n<O (prewindowing), R,, isa function of N . If we use the definition 


RAN) = y x,, (nx! (n) (7.5.19) 
then the time-updating (7.5.18) can be written as j 
Rn = RAN -D= RN) -2p N -Drp (N 1) (1.5.20) 
and the order-updating (7.5.16) as 


a nb f fH 
RN) pa zA i bee Fn) l (7521) 

Fm(N) E,(N)} LPn(N) R N-D 
which has the same form as (6.3.3). Therefore, all order recursions developed in Section 6.3 can be applied in the 
prewindowing case. However, to get a complete algorithm, we need recursions for the time updatings of the BLP 
b,,(N -1) > 6,,(N) and E®(N—-1)— E’(N), which can be developed by using the time-recursive algorithms 
developed in Chapter 9 for LS adaptive filters. The postwindowing case can be developed in a similar fashion, but it 
is of no particular practical interest. 


In the no-windowing case (N; =M —1, N; =N —1), matrices R,, and ĝi, depend on both M and N. Thus, 
although the development of order recursions can be done as in the prewindowing case, the time updatings are more 
complicated due to (7.5.17) (Morf et al. 1977). Setting the lower limit to N, =M —1 means that all filters c,,, 
1 <m < M , are optimized over the interval M —1 < n < N-1, which makes the optimum nesting property 
possible. If we set N; =m-—1, each filter C,, is optimized over the interval m—1 < n < N —1; that is, it utilizes 
all the available data. However, in this case, the optimum nesting property R..= Re does not hold, and the 


resulting order-recursive algorithms are slightly more complicated (Kalouptsidis et al. 1984). 

The development of order-recursive algorithms for FBLP least-squares filters and predictors with linear phase 
constraints, for example, c,,=+Jc,,, is more complicated, in general. A review of existing algorithms and more 
references can be found in Theodoridis and Kalouptsidis (1993). 

In conclusion, we notice that order-recursive algorithms are more efficient than the LDL” decomposition—based 
solutions only if N is much larger than M. Furthermore, their numerical properties are inferior to those of the LDL” 
decomposition methods; therefore, a bit of extra caution needs to be exercised when order-recursive algorithms are 
employed. 


7.6 Summary 


In this chapter we discussed the theory, implementation, and application of linear estimators (combiners, filters, and 
predictors) that are optimum according to the LSE criterion of performance. The fundamental differences between 
linear MMSE and LSE estimators are as follows: 

e MMSE estimators are designed using ensemble average second-order moments R and d; they can be designed 
prior to operation, and during their normal operation they need only the input signals. 

e LSE estimators are designed using time-average estimates R and d of the second-order moments or data 
matrix X and the desired response vector y. For this reason LSE estimators are sometimes said to be 
data-adaptive. The design and operation of LSE estimators are coupled and are usually accomplished by 
using either of the following approaches: 

—Collect a block of training data X, and y,, and use them to design an LSE estimator; use it to 
process subsequent blocks. Clearly, this approach is meaningful if all blocks have statistically similar 
characteristics. 
—For each collected block of data X and y, compute the LSE filter c,, or the LSE estimate y 
(whatever is needed). 
There are various numerical algorithms designed to compute LSE estimators and estimates. For well-behaved 
data and sufficient numerical precision, all these methods produce the same results and therefore provide the same 
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LSE performance, that is, the same total squared error. 

However, when ill-conditioned data, finite precision, or computational complexity is a concern, the choice of the 
LS computational algorithm is very important. 

In conclusion, we emphasize that there are various ways to compute the coefficients of an optimum estimator 
and the value of the optimum estimate. We stress that the performance of any optimum estimator, as measured by the 
MMSE or LSE, does not depend on the particular implementation as long as we have sufficient numerical precision. 
Therefore, if we want to investigate how well an optimum estimator performs in a certain application, we can use any 
implementation, as long as computational complexity is not a consideration. 


Problems 

7.1 By differentiating (7.2.8) with respect to the vector C , show that the LSE estimator c,, is given by the solution of the normal 
equations (7.2.12). 

7.2 Let the weighted LSE be given by E,, =e"We , where W is a Hermitian positive definite matrix. 

(a) By minimizing E£,, with respect to the vector C , show that the wieghted LSE estimator is given by (7.2.35). 
(b) Using the LDL” decomposition W = LDL" , show that the weighted LS criterion corresponds to prefiltering the error or the 
data. 

7.3 Using direct substitution of (7.4.4) into (7.4.5), show that the LS estimator €% and the associated LS error E® are determined 
by (7.4.5). 

7.4 Consider a linear system described by the difference equation y(n) =0.9y(n—1)+0.1x (n—1)+v(n), where x(n) is the input 
signal, y(n) is the output signal, and y(n) is an output disturbance. Suppose that we have collected N =1000 samples of 
input-output data and that we wish to estimate the system coefficients, using the LS criterion with no windowing. Determine the 
coefficients of the model y(n) = ay(n—1)+dx(n—1) and their estimated covariance matrix G2R when 
(a) x(n)~WGN(0,1) and v(n)~WGN(0,1) and 
(b) x(n)~WGN(0,1) and v(n)=0.8v(n—1)+@(n) is an AR(1) process with w(n) ~ WGN (0,1). Comment upon the 
quality of the obtained estimates by comparing the matrices G2 Ê` obtained in each case. 

7.5 Use Lagrange multipliers to show that Equation (7.4.13) provides the minimum of (7.4.8) under the constraint (7.4.9). 

7.6 If full windowing is used in LS, then the autocorrelation matrix is Toeplitz. Using this fact, show that in the combined FBLP the 
predictor is given by 

a> =i 
2 

7.7 Consider the noncausal “middle” sample linear signal estimator specified by (7.4.1) with M =2L and i=L. 

(a) Show that if we apply full windowing to the data matrix, the resulting signal estimator is conjugate symmetric, that is, 
c = Jc”. This property does not hold for any other windowing method. 
(b) Derive the normal equations for the signal estimator that minimizes the total squared error E“) = lel under the constraint 


(a+ Jb’) 


cD = Je. 
(c) Show that if we enforce the normal equation matrix to be centro-Hermitian, that is, we use the normal equations 
0 
(X"X +JX Xe =| E” 
0 


then the resulting signal smoother is conjugate symmetric. 
(d) Illustrate parts (a) to (c), using the data matrix 


1 
2 
X =|3 
1 
1 


Nv oF p m 
= m Doe = 


and check which smoother provides the smallest total squared error. Try to justify the obtained answer. 
7.8 A useful impulse response for some geophysical signal processing applications is the Mexican hat wavelet 


g(t)= ara-rye** 
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which is the second derivative of a Gaussian pulse. 

(a) Plot the wavelet g(t) and the magnitude and phase of its Fourier transform. 

(b) By examining the spectrum of the wavelet, determine a reasonable sampling frequency F. 

(c) Design an optimum LS inverse FIR filter for the discrete-time wavelet g(nT), where T =1/ F,. Determine a reasonable 
value for M by plotting the LSE E,, as a function of order M. Investigate whether we can improve the inverse filter by 
introducing some delay 7%. Determine the best value of n, and plot the impulse response of the resulting filter and the 
combined impulse response g(n)*h(n—n,), which should resemble an impulse. 

(d) Repeat part (c) by increasing the sampling frequency by a factor of 2 and comparing with the results obtained in part (c). 

7.9 (a) Prove Equation (7.5.4) regarding the LDL" decomposition of the augmented matrix R . 

(b) Solve the LS estimation problem in Example 7.5.1, using the LDL" decomposition of R and the partitionings in (7.5.4). 

7.10 Prove the order-recursive algorithm described by the relations given in (7.5.12). Demonstrate the validity of this approach, using 
the data in Example 7.5.1. 

7.11 In this problem, we wish to show that the statistical interpretations of innovation and partial correlation for w,,(n) and km, in 
(7.5.12) hold in a deterministic LSE sense. To this end, suppose that the “partial correlation” between jy and ~,,,, is defined 
using the residual records ¢,,=Y—XpCm and 6° = $m +X mbm» Where b, is the LSE BLP. Show that ky.) = Smiil Emsi 
where B, get, and é, = gtg}, . Demonstrate the validity of these formulas using the data in Example 7.5.1. 


7.12 Show that the Cholesky decomposition of a Hermitian positive definite matrix R can be computed by using the following 
algorithm 


for j=ltoM 


j-i 
i, =(r, =9 ia Di 
k=l 


fori=j+ltoM 


j-l 
L,=(; -È laly VL; 
end i 
end j 


and write a MATLAB function for its implementation. Test your code using the built-in MATLAB function chol. 

7.13 In this problem we examine in greater detail the radio-frequency interference cancelation experiment discussed in Section 7.4.3. 
We first explain the generation of the various signals and then proceed with the design and evaluation of the LS interference 
canceler. 

(a) The useful signal is a pointlike target defined by 


s(t) = d/dt(1 fet te |£ as) 


where @=2.3, t, =0.4, and t; =2. Given that F, =2 GHz, determine s(n) by computing the samples g(n)= g(nT) in the 
interval —2 < nT < 6 ns and then computing the first difference s(n) = g(n)—g(n—1). Plot the signal s(n) and its 
Fourier transform (magnitude and phase), and check whether the pointlike and wideband assumptions are justified. 

(b) Generate N = 4096 samples of the narrowband interference using the formula 


L 
z(n)= >) A sin(@n+@,) 


and the following information: 


Fs=2; tAllfrequenciesare measured inGHz. 

F=0.1*(0.6 1 1.8 2.1 3 4.8 5.2 5.7 6.1 6.4 6.7 7 7.89.3]; 
L=length(F) ; 

om =2* pi*F/Fs; 

A=[0.5 11 0.5 0.1 0.3 0.5111 0.5 0.31.5 0.5]; 

rand (’seed’, 1954) ; 

phi =2*pi*rand(L, 1); 


(c) Compute and plot the the periodogram of z(n) to check the correctness of your code. 
(d) Generate N samples of white Gaussian noise y(n) ~ WGN (0, 0.1) and create the observed signal x(n) =5s(n—ny)+ z(n)+v(n), 
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7.14 


TAS 
7.16 


7.17 


7.18 


where n =1000 . Compute and plot the periodogram of x(n) . 
(e) Design a one-step ahead (D=1) linear predictor with M =100 coefficients using the FBLP method with no windowing. 
Then use the obtained FBLP to clean the corrupted signal x(n) as shown in Figure 7.7. To evaluate the performance of the 
canceler, generate the plots shown in Figures 7.8 and 7.9. 
Careful inspection of Figure 7.9 indicates that the the D-step prediction error filter, that is, the system with input x(n) and output 
e (n), acts as a whitening filter. In this problem, we try to solve Problem 7.13 by designing a practical whitening filter using a 
power spectral density (PSD) estimate of the corrupted signal x(n) . 
(a) Estimate the PSD ge"), @, =2mk/N pr» Of the signal x(n), using the method of averaged periodograms. Use a 
segment length of L=256 samples, 50 percent overlap, and N pr =512- 
(b) Since the PSD does not provide any phase information, we shall design a whitening FIR filter with linear phase by 

_, 2” Nely 


Äke ™ ? 


Va") 


where H (k) is the DFT of the impulse response of the filter, that is, 


$a . 2x 


A(k= S$ nner” 


n=0 

wih O < k <N,,,-1- 

(c) Use the obtained whitening filter to clean the corrupted signal x(n), and compare its performance with the FBLP canceler by 
generating plots similar to those shown in Figures 7.8 and 7.9. 

(d) Repeat part (c) with L=128, Nprr =512 and L=512, Nprr =1024 and check whether spectral resolution has any 
effect upon the performance. Note: Information about the design and implementation of FIR filters using the DFT can be found in 
Proakis and Manolakis (1996). 

Repeat Problem 7.14, using the multitaper method of PSD estimation. 

In this problem we develop an RFI canceler using a symmetric linear smoother with guard samples defined by 


M 
e(n) = x(n) — X(n) Ê x(n)+ >, c,[x(n—k)+x(n+k)] 
k=D 
where 1 < D < M prevents the use of the D adjacent samples to the estimation of x(n) . 
(a) Following the approach used in Section 7.4.3, demonstrate whether such a canceler can be used to mitigate RFI and under what 
conditions. 
(b) If there is theoretical justification for such a canceler, estimate its coefficients, using the method of LS with no windowing for 
M =50 and D=1 for the situation described in Problem 7.13. 
(c) Use the obtained filter to clean the corrupted signal x(n) , and compare its performance with the FBLP canceler by generating 
plots similar to those shown in Figures 7.8 and 7.9. 
(d) Repeat part (C ) for D=2. 
In Example 5.7.1 we studied the design and performance of an optimum FIR inverse system. In this problem, we design and 
analyze the performance of a similar FIR LS inverse filter, using training input-output data. 
(a) First, we generate N =100 observations of the input signal y(n) and the noisy output signal x(n). We assume that 
x(n) ~WGN(0,1) and v(n) ~WGN(0,0.1). To avoid transient effects, we generate 200 samples and retain the last 100 
samples to generate the required data records. 
(b) Design an LS inverse filter with M =10 for 0 < D < 10, using no windowing, and choose the best value of delay D. 
(c) Repeat part (b) using full windowing. 
(d) Compare the LS filters obtained in parts (b) and (c) with the optimum filter designed in Example 5.7.1. What are your 
conclusions? 
In this problem we estimate the equalizer discussed in Example 5.8.1, using input-output training data, and we evaluate its 
performance using Monte Carlo simulation. 
(a) Generate N =1000 samples of input-desired response data { x(n), a(n)})) —' and use them to estimate the correlation 
matrix R, and the cross-correlation vector d between x(n) and y(n—D).Use D=7, M =11, and W =2.9. Solve the 
normal equations to determine the LS FIR equalizer and the corresponding LSE. 
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(b) Repeat part (a) 500 times; by changing the seed of the random number generators, compute the average (over the realizations) 
coefficient vector and average LSE, and compare with the optimum MSE equalizer obtained in Example 5.8.1. What are your 
conclusions? 

(c) Repeat parts (A ) and (b ) by setting W =3.1. 
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CHAPTER 8 


Signal Modeling and Parametric 


Spectral Estimation 


This chapter is a transition from theory to practice. It focuses on the selection of an appropriate model for a given set 
of data, the estimation of the model parameters, and how well the model actually “fits the data.” Although the 
development of parameter estimation techniques requires a strong theoretical background, the selection of a good 
model and its subsequent evaluation require the user to have sufficient practical experience and a familiarity with the 
intended application. We provide complete, detailed algorithms for fitting pole-zero models to data using 
least-squares techniques. The estimation of all-pole model parameters involves the solution of a linear system of 
equations, whereas pole-zero modeling requires nonlinear least-squares optimization. The chapter is roughly 
organized into two separate but related parts. 

In the first part, we begin in Section 8.1 by explaining the steps that are required in the model-building process. 
Then, in Section 8.2, we introduce various least-squares algorithms for the estimation of parameters of direct and 
lattice all-pole models, provide different interpretations, and discuss some order selection criteria. For pole-zero 
models we provide, in Section 8.3, a nonlinear optimization algorithm that estimates the parameters of the model by 
minimizing the least-squares criterion. We conclude this part with Section 8.4 in which we discuss the applications of 
pole-zero models to spectral estimation and speech processing. 

In the second part, we begin with the method of minimum-variance spectral estimation (Capon’s method). Then 
we describe frequency estimation methods based on the harmonic model: the Pisarenko harmonic decomposition and 
the MUSIC, minimum-norm, and ESPRIT algorithms. These methods are suitable for applications in which the 
signals of interest can be represented by complex exponential or harmonic models. Signals consisting of complex 
exponentials are found in a variety of applications including as formant frequencies in speech processing, moving 
targets in radar, and spatially propagating signals in array processing. 


8.1 The Modeling Process: Theory and Practice 


In this section, we discuss the modeling of real-world signals using parametric pole-zero (PZ) signal models, whose 
theoretical properties were discussed in Chapter 3 We focus on PZ ( P,Q ) models with white input sequences, which 
are also known as ARMA (P, Q) random signal models. These models are defined by the linear constant-coefficient 
difference equation 


P Q 
x(n)=-9_ a,x(n-k)+a{n)+ > d,an—-k) (8.1.1) 
k=l k=l 
where @(n) ~WN(0,02,) with o2 < œ. The power spectral density (PSD) of the output signal is 


1+ Sa gaa 


Re) =02| -gp Der (8.1.2) 
o Q = @ | A(e™!”) P 
1+ Ya, g~ 
k=l 
which is a rational function completely specified by the parameters, {a,, a2, ---, ap}, {di, =, dọ}, and o2. We 


stress that since these models are linear, time-invariant (LTI), the resulting process x(n) is stationary, which is 
ensured if the corresponding systems are BIBO stable. 

The essence of signal modeling and of the resulting parametric spectrum estimation is the following: Given 
finite-length data {x(n)}‘:}, which can be regarded as a sample sequence of the signal under consideration, we 
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want to estimate signal model parameters {g,}?, {f,}?, and G2, to satisfy a prescribed criterion. Furthermore, if the 
parameter estimates are sufficiently accurate, then the following formula 


Q : 2 
LOY Tg Ô -jø) |2 
Re”) = 6? -a o Ea amia (8.1.3) 
1+ 74,6 | A(e’*) | 


should provide a reasonable estimate of the signal PSD. A similar argument applies to harmonic signal models and 
harmonic spectrum estimation in which the model parameters are the amplitudes and frequencies of complex 
exponentials (see Section 2.1.6). 

The development of such models involves the steps shown in Figure 8.1 In this chapter, we assume that we have 
removed trends, seasonal variations, and other nonstationarities from the data. We further assume that unit poles have 
been removed from the data by using the differencing approach discussed in Box et al. (1994). 


Model selection 

In this step, we basically select the structure of the model (direct or lattice), and we make a preliminary decision 
on the orders P and Q of the model. The most important aid to model selection is the insight and understanding of the 
signal and the physical mechanism that generates it. Hence, in some applications (e.g., speech processing) physical 
considerations point to the type and order of the model; when we lack a priori information or we have insufficient 
knowledge of the mechanism generating the signal, we resort to data analysis methods. 

In general, to select a candidate model, we estimate the autocorrelation, partial autocorrelation, and power 
spectrum from the available data, and we compare them to the corresponding quantities obtained from the theoretical 
models (see Table 3.1). This preliminary data analysis provides sufficient information to choose a PZ model and 
some initial estimate for P and Q to start a model building process. Several order selection criteria have been 
developed that penalize both model misfit and a large number of parameters. Although theoretically interesting and 
appealing, these criteria are of limited value when we deal with actual signals. 
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FIGURE 8.1 
Steps in the signal model building process. 


The model structure influences (1) the complexity of the algorithm that estimates the model parameters and (2) 
the shape of the criterion function (quadratic or nonquadratic). Therefore, the structure (direct or lattice) is not critical 
to the performance of the model, and its choice is not as crucial as the choice of the order of the model. 


Model estimation 

In this step, also known as model fitting, we use the available data {x(n)}-! to estimate the parameters of the 
selected model, using optimization of some criterion. Although there are several criteria (e.g., maximum likelihood, 
spectral matching) that can be used to measure the performance or quality of a PZ model, we concentrate on the 
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least-squares (LS) error criterion. As we shall see, the estimation of all-pole (AP) models leads to linear optimization 
problems whereas the estimation of all-zero (AZ) and PZ models requires the solution of nonlinear optimization 
problems. Parameter estimation for PZ models using other criteria can be found in Kay (1988), Box et al. (1994), 
Porat (1994), and Ljung (1987). 


Model validation 

Here we investigate how well the obtained model captures the key features of the data. We then take corrective 
actions, if necessary, by modifying the order of the model, and repeat the process until we get an acceptable model. 
The goal of the model validation process is to find out whether the model 


e Agrees sufficiently with the observed data 
e Describes the “true” signal generation system 
è Solves the problem that initiated the design process 


Of course, the ultimate test is whether the model satisfies the requirements of the intended application, that is, the 
objective and subjective criteria that specify the performance of the model, computational complexity, cost, etc. In 
this discussion, we concentrate on how well the model fits the observed data in an LS error statistical sense. 

The existence of any structure in the residual or prediction error signal indicates a misfit between the model and 
the data. Hence, a key validation technique is to check whether the residual process, which is generated by the inverse 
of the fitted model, is a realization of white noise. This can be checked by using, among others, the following 
statistical techniques (Brockwell and Davis 1991; Bendat and Piersol 1986): 


Autocorrelation test. It can be shown (Kendall and Stuart 1983) that when N is sufficiently large, the 
distribution of the estimated autocorrelation coefficients A(1)=7(1)/F(0) is approximately Gaussian with zero 
mean and variance of 1/N. The approximate 95 percent confidence limits are +1.96 / VN. Any estimated values of 
P(1) that fall outside these limits are “significantly” different from zero with 95 percent confidence. Values well 
beyond these limits indicate nonwhiteness of the residual signal. 


Power spectrum density test. Given a set of data {x(n)}%{}, the standardized cumulative periodogram is 
defined by 


0 k<1 
k 
P Re) 
i(k) 2,44 1<k<K (8.1.4) 
P Re”) 
i=l 
1 k>K 


where K is the integer part of N/2. If the process x(n) is white Gaussian noise (WGN), then the random 
variables Ī(k), k=1, 2, ---, K, are independently and uniformly distributed in the interval (0,1), and the plot 
of Ī(k) should be approximately linear with respect to k (Jenkins and Watts 1968). The hypothesis is rejected at 
level 0.05 if Ī(k) exits the boundaries specified by 


7k) = 41.36K -1)" 12k <S K (8.1.5) 


Partial autocorrelation test. This test is similar to the autocorrelation test. Given the residual process x(n), it 
can be shown (Kendall and Stuart 1983) that when N is sufficiently large, the partial autocorrelation sequence (PACS) 
values {k,} for lag / [defined in (3.2.44)] are approximately independent with distribution WN (0, 1/N). This 
means that roughly 95 percent of the PACS values fall within the bounds +1.96/ VN. If we observe values 
consistently well beyond this range for N sufficiently large, it may indicate nonwhiteness of the signal. 


EXAMPLE 9.1.1. To apply the above tests and interpret their results, we consider a WGN sequence x(n) . By using the randn 
function, 100 samples of x(n) with zero mean and unit variance were generated. These samples are shown in Figure 8.2. From 
these samples, the autocorrelation estimates up to lag 40, denoted by {7(/)}#2,, were computed using the autoc function, from 
which the correlation coefficients (7) were obtained. The first 10 coefficients are shown in Figure 8.2 along with the 
appropriate confidence limits. As expected, the first coefficient at lag O is unity while the remaining coefficients are within the 
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Next, using the psd function, a periodogram based on 100 samples was computed, from which the cumulative periodogram 
I (k) was obtained and plotted as a function of the normalized frequency, as shown in Figure 8.2. The confidence limits are also 
shown. The computed cumulative periodogram is a monotonic increasing function lying within the limits. 

Finally, using the durbin function, PACS sequencék;}#°, was computed from the estimated correlations and plotted in 
Figure 8.2. Again all the values for lags Z >21 are within the confidence limits. Thus all three tests suggest that the 100-point data 
are almost surely from a white noise sequence. 


Although the whiteness of the residuals is a good test for model fitting, it does not provide a definite answer to 
the problem. Some additional procedures include checking whether 


e The criterion of performance decreases (fast enough) as we increase the order of the model. 
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FIGURE 8.2 
Validation tests on white Gaussian noise in Example 8.1.1. 


e The estimate of the variance of the residual decreases as the number N of observations increases. 

e Some estimated parameters that have physical meaning (e.g., reflection coefficients) assume values that make 
sense. 

e The estimated parameters have sufficient accuracy for the intended application. 


Finally, to demonstrate that the model is sufficiently accurate for the purpose for which it was designed, we can 
use a method known as cross-validation. Basically, in cross-validation we use one set of data to fit the model and 
another, statistically independent set of data to test it. Cross-validation is of paramount importance when we build 
models for control, forecasting, and pattern recognition (Ljung 1987). However, in signal processing applications, 
such as spectral estimation and signal compression, where the goal is to provide a good fit of the model to the 
analyzed data, cross-validation is not as useful. 


8.2 Estimation of All-Pole Models 


We next use the principle of least squares to estimate parameters of all-pole signal models assuming both white and 
periodic excitations. We also discuss criteria for model order selection, techniques for estimation of all-pole lattice 
parameters, and the relationship between all-pole estimation methods using the methods of least squares and 
maximum entropy. The relationship between all-pole model estimation and minimum-variance spectral estimation is 
explored in Section 8.5. 


8.2.1 Direct Structures 
Consider the AR(P,) 
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h 
x(n) =-)) a,x(n—k)+ an) (8.2.1) 
k=l 
where œ(n)~ WN(0, o2). The Pth-order linear predictor of x(n) is given by 
P 
ŝ(n)=-}_ âix(n-k) (8.2.2) 
k=l 
and the corresponding prediction error sequence is 
P 
e(n) =x(n)-ê(n) = x(n) + > Gix(n—-k) (8.2.3) 
k=1 
_ a'x(n) (8.2.4) 
where G )=1 and 
a=(l Gy, âr (8.2.5) 
x(n) =[x(n)x(n—-1) --- x(n — P)] (8.2.6) 
Thus the error over the range N; <n < N; can be expressed as a vector 
£ e = Xâ a (8.2.7) 
where X is the data matrix defined in (7.4.3). For the full-windowing case, the data matrix X is given by 
x(0) x) «-- xP) -e 0 a 0 
a. OC ee e r e (8.2.8) 
0 O «+ x(0) > x(N-P) + x(N-1) 
while for the no-windowing case the data matrix X is 
x(P) x(P+l1) -= x(N —2) x(N -1) 
= P-D X) e x(N —3) x(N —2) (8.2.9) 
x(0) x(1) s+ xX(N-P-2) x(N-P-1) 


Notice that if P= AR, and g,=a,, the prediction error e(n) is identical to the white noise excitation w(n). 
Furthermore, if AR (P) is minimum-phase, then w(n) is the innovation process of x(n) and (n) is the 
MMSE prediction of x(n). Thus, we can obtain a good estimate of the model parameters by minimizing some 
function of the prediction error. 

In theory, we minimize the MSE E{|e(n)|’}. In practice, since this is not possible, we estimate {a,}? fora 
given P by minimizing the total squared error 


2 





Ny Nr P 
E, = > |en) P= >) |x(n) +> âixn-k) (8.2.10) 
n=N; n=N; k=1 
y _ 
= > Jax Poa" x” â (8.2.11) 


over the range N; < n < N,;. Hence, we can use the methods discussed in Section 7.4 for the computation of LS 
linear predictors. In particular, the forward linear predictor coefficient {â,}f-; and the associated LS error êp are 
obtained by solving the normal equations 


(x"X)a= P (8.2.12) 
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The solution of (8.2.12) is discussed extensively in Chapter 7. 
The least-squares AP(P) parameter estimates have properties similar to those of linear prediction. For example, 
if the process @(n) is Gaussian, the least-squares no-windowing estimates are also maximum-likelihood estimates 


(Jenkins and Watts 1968). The variance of the excitation process can be obtained from the LS error êp by 
1 NgPol 


ô = eln) f full windowing (8.2.13) 
Wop"? NaF oy het 


1 
n= 











N-I 
or ó= : see o Dle(n)P no windowing (8.2.14) 
for the full-windowing or no-windowing methods, respectively. Furthermore, in the full-windowing case, if the 
Toeplitz correlation matrix is positive definite, the obtained model is guaranteed to be minimum-phase (see Section 
6.4). MATLAB functions 
[ahat, e, V]=arwin(x, P)and[ahat, e, V]=arls (x, P) 

are provided that compute the model parameters, the error sequence, and the modeling error using the full-windowing 
and no-windowing methods, respectively. 

We present three examples below to illustrate the all-pole model determination and its use in PSD estimation. 
The first example uses real data consisting of water-level measurements of Lake Huron from 1875 to 1972. The 
second example also uses real data containing sunspot numbers for 1770 through 1869. These sunspot numbers have 
an approximate cycle of period around 10 to 12 years. The Lake Huron and sunspot data are shown in Figure 8.3. The 
third example generates simulated AR (4) data to estimate model parameters and through them the PSD values. In 


each case, the mean was computed and removed from the data prior to processing. 
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The Lake Huron and sunspot data used in Examples 8.2.1 and 8.2.2. 


EXAMPLE 8.2.1. A careful examination of Lake Huron water-level measurement data indicates that a low-order all-pole model 
might be a suitable representation of the data. To test this hypothesis, first- and second-order models were considered. Using the 
full-windowing method, model parameters were computed: 


First- order âi = —0.791, ô? = 0.5024 
Second-order â; =—1.002, â = 0.2832, 62, = 0.4460 


Using these model parameters, the data were filtered and the residuals were computed. Three tests for checking the whiteness 
of the residuals as described in Section 8.1 were performed to ascertain the validity of models. In Figure 8.4, we show the residuals, 
the autocorrelation test, the PSD test, and the partial correlation test for the first-order model. The partial correlation test indicates 
that the PACS coefficient at lag 1 is outside the confidence limits and thus the first-order model is a poor fit. In Figure 8.5 we show 
the same plots for the second-order model. Clearly, these tests show that the residuals are approximately white. Therefore, the 
AR(2) model appears to be a good match to the data. 
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Validation tests on the first-order model fit to the Lake Huron water-level measurement data in Example 8.2.1. 





Residual samples Autocorrelation test 
3 
2 
1 
O S 
v 9 = 
1 
-2 
-3 
0 20 40 © +#@ 80 0 5 10 15 20 25 30 35 40 
n Lag/ 


PSD test Partial autocorrelation test 














S 
N 
02 b ae ‘ L 1 1 
0 0.1 02 0.3 04 05 0 5 10 15 20 2 30 35 40 
Frequency (cycles/sample) Lag/ 
FIGURE 8.5 


Validation tests on the second-order model fit to the Lake Huron water-level measurement data in Example 8.2.1. 


EXAMPLE 8.2.2. Figure 8.6 shows the PACS coefficients of the sunspot numbers along with the 95 percent confidence limits. 
Since all PACS values beyond lag 2 fall well inside the limits, a second-order model is a possible candidate for the data. Therefore, 
the second-order model parameters were estimated from the data to obtain the model 


x(n) =1.318x(n —1) —0.634x(n —2)+ @(n) 62, = 289.2 


In Figure 8.7 we show the residuals obtained by filtering the data along with three tests for its whiteness. The plots show that the 
estimated model is a reasonable fit to the data. Finally, in Figure 8.8 we show the PSD estimated from the AR(2) model as well as 
from the periodogram. The periodogram is very noisy and is devoid of any structure. The AR(2) spectrum is smoother and 
distinctly shows a peak at 0.1 cycle per sampling interval. Since the sampling rate is 1 sampling interval per year, the peak 
corresponds to 10 years per cycle, which agrees with the observations. Thus the parametric approach to PSD estimation was 
appropriate. 
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FIGURE 8.6 
The PACS values of the sunspot numbers in Example 8.2.2. 
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Validation tests on the second-order model fit to the sunspot numbers in Example 8.2.2. 
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FIGURE 8.8 
Comparison of the periodogram and the AR(2) spectrum in Example 8.2.2. 


EXAMPLE 8.2.3. We illustrate the least-squares algorithms described above, using the AR(4) process x(n) introduced in 
Example 4.3.2. The system function of the model is given by 
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1 
1—2.7607z +3.8106z” —2.6535z ° + 0.9238z~ 


and the excitation is a zero-mean Gaussian white noise with unit variance. Suppose that we are given the N = 250 samples of x(n) 
shown in Figure 8.9 and we wish to model the underlying process by using an all-pole model. To identify a candidate model, we 
compute the autocorrelation, partial autocorrelation, and periodogram, using the available data. Careful inspection of Figure 8.9 
and the signal model characteristics given in Table 3.1 suggests an AR model. Since the PACS plot cuts off around P=5 we 
choose P = 4 and fit an AR(4) model to the data, using both the full-windowing and no-windowing methods. Figure 8.10 shows the 
actual spectrum of the process, the spectra of the estimated models, and the periodogram. Clearly, the no-windowing estimate 
provides a better fit because it does not impose any windowing on the data. Figure 8.11 shows the residual, autocorrelation, partial 
autocorrelation, and periodogram for the no-windowing-based model. We see that the residuals can be assumed uncorrelated with 
reasonable confidence, which implies that the model captures the second-order statistics of the data. 
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FIGURE 8.9 
Data segment from an AR(4) process, and the corresponding autocorrelation, partial autocorrelation, and periodogram. 
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FIGURE 8.10 


Periodogram, theoretical AR(4) spectrum, and AR(4) model spectra using full windowing, Hamming windowing, and no 
windowing. 
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Residual sequence for the AR(4) data, and the corresponding autocorrelation, partial autocorrelation, and periodogram. 


Modified covariance method. The LS method described above to estimate model parameters uses the forward 
linear predictor and prediction error. There is also another approach that is based on the backward linear predictor. 
Recall that the backward linear predictor derived from the known correlations is the complex conjugate of the 
forward predictor (and likewise, the corresponding errors are identical). However, the LS estimators and errors based 
on the actual data are different because the data read in each direction are different from a statistical viewpoint. Hence, 
it is much more reasonable to consider both forward and backward predictors and to minimize the combined error 


Ny 
ep £ Șlem P +e n) P] 
n=N; 


N, 
= $ llâ”xn) F +|a’x"(n) f] (8.2.15) 
n=N, 
=â" X" Xâ+â" X Xâ 
subject to the constraint that the first component of â is 1. The minimization of ¢€ leads to the set of normal 
equations 
afb 
(X"X+X'X a [$] (8.2.16) 


which can be solved efficiently to obtain the model parameters (see Section 7.4.2). This method of using the 
forward-backward predictors is called the modified covariance method. Not only does it have the advantage of 
minimizing the combined global error, but also since it uses more data in (8.2.16), it gives better estimates and lower 
error. A similar minimization approach, but implemented at each local stage, is used in Burg’s method, which is 
discussed in Section 8.2.2. 


Frequency-domain interpretation. In the case of full windowing, by using Parseval’s theorem, the error energy 
can be written as 
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X(e” 
£= 2 |e(n) ->f aiao (8.2.17) 


where | X(e!”)|* is the spectrum of the modeled windowed signal segment and Â (ef?) is the frequency response 
of the estimated all-pole model [or estimated spectrum of x(n) ]. This expression is a good approximation for the 
other windowing methods if Nœ P. Since the integrand in (8.2.17) is positive, minimizing the error € is equivalent to 
minimizing the integrated ratio of the energy spectrum of the modeled signal segment to its all-pole-based spectrum. 


The presence of this ratio in (8.2.17) has three additional consequences. (1) The quality of the spectral matching 
is uniform over the whole frequency range, irrespective of the shape of the spectrum. (2) Since regions where 
| X(e!”)|>|H(e!”)| contribute more to the total error than regions where | X (e}®) |< | Ê (et?) | do, the match is 
better near spectral peaks than near Spectral valleys. (3) The all-pole model provides a good estimate of the envelope 
of the signal spectrum | X(e’”) |’. These properties are apparent in Figure 8.12, which shows a comparison 
between 20log| X(e!”)| (obtained using the periodogram) and 20log | Â (e?) | [obtained by an AP(28) 
model fitted using full windowing] for a 20-ms, Hamming windowed, speech signal sampled at 20 kHz. Note that the 
slope of | H (e?) |is always zero at frequencies @=0 and W=T7, as expected. More details on these issues can 
be found in Makhoul (1975b). 
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FIGURE 8.12 
Illustration of the spectral envelope matching property of all-pole models. 


The error energy (8.2.17) is also related to the Itakura-Saito (IS) distortion measure, which is given by 





d,,(R,,R,) A if [expV(e!”) -V (e) -1]d@ (8.2.18) 
= 20" 
where R,(e/”) and R,(e/”) are two spectra, and 
V (ef?) A log R,(e’”) — log R, (e°) (8.2.19) 
Indeed, we can show that 
R (ef?) o? 
R,,R, a “~~ d@-—log—+-1 (8.2.20) 
Gah. Ro. i R,(e!”) eo 


where oj and of are the variances of the innovation sequences corresponding to the factorization of spectra 
R,(e!”) and R2(e!”), respectively. More details can be found in Rabiner and Juang (1993). 


Order selection criteria. The order of an all-pole signal model plays an important role in the modeling problem. 
It determines the number of parameters to be estimated and hence the computational complexity of the algorithm. But 
more importantly, it affects the quality of the spectrum estimates. If a much lower order is selected, then the resulting 
spectrum will be smooth and will display poor resolution. If a much larger order is used, then the spectrum may 
contain spurious peaks at best and a phenomenon called spectrum splitting at worst, in which a single peak is split 
into two separate and distinct peaks (Hayes 1996). 
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Several criteria have been proposed over the years for model order selection; however, in practice nothing 
surpasses the graphical approach outlined in Examples 8.2.1 and 8.2.2 combined with the experience of the user. 
Therefore, we only provide a brief summary of some well-known criteria and refer the interested reader to Kay 
(1988), Porat (1994), and Ljung (1987) for more details. The simplest approach would be to monitor the modeling 
error and then select the order at which this error enters a steady state. However, for all-pole models, the modeling 
error is monotonically decreasing, which makes this approach all but impossible. The general idea behind the 
suggested criterion is to introduce a penalty function in the modeling error that increases with the model order P . 
We present the following four criteria that are based on the above general idea. 


FPE criterion. The final prediction error (FPE) criterion, proposed by Akaike (1970), is based on the function 


N+P , 
; (8.2.21) 
N-P°’ 


where G7? is the modeling error [or variance of the residual of the estimated AP(P) model]. We note that the term 
ô} decreases or remains the same with increasing P, whereas the term (N + P)/(N -—P) accounts for the 
increase in G7; due to inaccuracies in the estimated parameters and increases with P. Clearly, FPE(P) is an inflated 


version of 7. The FPE order selection criterion is to choose P that will minimize the function in (8.2.21). 


FPE(P)= 





AIC. The Akaike information criterion (AIC), also introduced by Akaike (1974), is based on the function 
AIC(P) = N log 6p+2P (8.2.22) 
It is a very general criterion that provides an estimate of the Kullback-Leibler distance (Kullback 1959) between an 
assumed and the true probability density function of the data. The performances of the FPE criterion and the AIC are 
quite similar. 


MDL criterion. The minimum description length (MDL) criterion was proposed by Risannen (1978) and uses 
the function 
MDL(P) = N log 6;+ Plog N (8.2.23) 
The first term in (8.2.23) decreases with P, but the second penalty term increases. It has been shown (Risannen 
1978) that this criterion provides a consistent order estimate in that as the probability that the estimated order is equal 
to the true order approaches 1, the data length N tends to infinity. 


CAT. This criterion is based on Parzen’s criterion autoregressive transfer (CAT) function (Parzen 1977), which 
is given by 





P 
CATP) =+ F = = Eiaa 
Nia Nôk Nôpr 
This criterion is asymptotically equivalent to the AIC and the MDL criteria. 

Basically, all order selection criteria add to the variance of the residuals a term that grows with the order of the 
model and estimate the order of the model by minimizing the resulting criterion. However, when P< N which is the 
case in many practical applications, the criterion does not exhibit a clear minimum that makes the order selection 
process difficult (see Problem 8.1). 


(8.2.24) 


8.2.2 Lattice Structures 


We noted in Section 6.5 that a prediction error filter, and hence the AP model, can also be implemented by using a 
lattice structure. The Pth-order forward prediction error e(n) = ef (n) and the total squared error 


Nt 
e= Y len) È (8.2.25) 
n=N; 


are nonlinear functions of the lattice parameters k,,, 0 < m < P-—1. For example, if P=2, we have 
e(n) = x(n) + (ki + kki )x(n—1)+k'x(n-2) 


which shows that ef (n) depends on the product kjk; . Thus, fitting an all-pole lattice model by minimizing €p 
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with respect to k,,, 0 < m < P-1, leads to a difficult nonlinear optimization problem. 
We can avoid this problem by replacing the above “global” optimization with P “local” optimizations from 
m= 1 to P, one for each stage of the lattice. From the lattice equations 


el (n)=e!_,(n) +k; e? _,(n—-1) (8.2.26) 


e (n)=e_(n-1)+k„ ef (n) (8.2.27) 


we see that the mth-order prediction errors depend on the coefficient k,,-; only. Furthermore, the values of e/_,(n) 
and e’_,(n) have been computed by using k,,_., which has been determined from the optimization step at the 
previous stage. 

Hence, to minimize the forward prediction error 


Ny 
é => lefn)? (8.2.28) 
=N, 


we substitute (8.2.26) into (8.2.28) and differentiate with respect to k;,_,. This leads to the following optimum value 
of km-i 





fb 
KP = Bri (8.2.29) 
En-l 
Np 
where i= [el (n) e? (n-1) (8.2.30) 
n=N, 
N, 
ai e& => len-)? (8.2.31) 
n=N; 


Similarly, minimization of the backward prediction error (8.2.31) gives 


fb 


kra =- 2 (8.2.32) 


f 
m-l 


Burg (1967) suggested the estimation of k „_, by minimizing 


m-1 


N, 
e= Y Emp +e P) (8.2.33) 


n=N; 


at each stage of the lattice.' Indeed, substituting (8.2.26) and (8.2.27) in the last equation, we obtain the relationship 


ER =(1+|k,,_, Pe, +4 Re(k. 8 ,)+ (1+ | kpa Per, (8.2.34) 
If we set dE”/dk;,_, =0, we obtain the following estimate of k,,_,: 
fb FM 7,BM 
ke shm thn (8.2.35) 


1 FM BM 
2 (ef + E ) kai + ka 

We note that kB_, is the harmonic mean of kKF?, and kB?,. We also stress that the obtained model is different from 
the one resulting from the forward-backward least-squares (FBLS) method through global optimization [see (8.2.11)]. 


Itakura and Saito (1971) proposed an estimate of k,_, based on replacing the theoretical ensemble averages in 


m-l 


'This approach should not be confused with the maximum entropy method introduced also by Burg and discussed later. 
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(6.5.24) by time averages. Their estimate is given by 


fb 
| =. a = sign(k” or ker ke kee (8.2.36) 


| f 
ee 1 


and is also known as the geometric mean method. Since it can be shown that 
[kpa 1S [Rea 1S 1 (8.2.37) 


both estimates result in minimum-phase models (see Problem 8.2). From (8.2.36) and (8.2.37) we conclude that if 
|kFP, |< 1, then |kB?,|>1 and vice versa; that is, if the FLP is minimum-phase, then the BLP is maximum-phase 
and vice versa. Several other ¢stimates are discussed in Makhoul (1977) and Viswanathan and Makhoul (1975). 

In all previous methods, we use no windowing; that is, we set N; =m and N,;=N-—1. If we use data 
windowing, all the above estimates are identical to the data windowing estimates obtained using the algorithm of 
Levinson-Durbin (see Problem 8.3). 

The variance of the residuals can be estimated by 





fb 
9? = 1 E, (8.2.38) 
2 N-m 
which for large values of N (see Problem 8.12) can be approximated by 
n= On| kya P) (8.2.39) 
1 N-1 
where 65=— Dd Ix) (8.2.40) 
N n=0 
The computations for the lattice estimation methods are summarized in Table 8.1, and the algorithms are 
implemented by the function [k, var] = aplatest(x, P). 
TABLE 8.1 


Algorithm for estimation of AP lattice parameters. 

1. Input: x(n) for N, <n < N, 

2. Initialization 

a. e\(n)=e>(n) = x(n). 

b. Compute 8’, Ej,and E? from x(n) 

c. Compute k? and k? 

d. Compute either k5 or k? from k? and k? 

e. Apply the first stage of the lattice to x(n) using either k or k? toobtain e'(n) and e(n). 
.For m=2, 3, ,M 

a. Compute fa. E f sand Eè, from ef (n) and e? _,(n) 
b. Compute k , and kiiy 


c. Compute either ki, or kë, from ki, and k}, 


Ww 


d. Apply the 7M thstage of the lattice to e‘'_,(m) and e? (n) usingeither k, or k to obtain e‘ (n) and e(n). 
4. Output: Either k? or k? for m=1, 2, =, M and e‘(n) and e?(n). 





8.2.3 Maximum Entropy Method 


We next show how LS all-pole modeling is related to Burg’s method of maximum entropy. To this end, suppose that 
x(n) is a normal, stationary process with zero mean. The M-dimensional complex-valued vector x = gli) 
obeys a normal distribution 


= _exp(-x"R"'x) (8.2.41) 


x 
P(x) n” det R 
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where R is a Toeplitz correlation matrix. By definition, its entropy is given by 


(x) Ê-—E{log p(x)} = M logn +log(det R) + M (8.2.42) 
because E{x"”R'x}= M .Ifthe process x(n) is regular, thatis, |k„|<1 forall m ,we have 
M-1 m 
detR=[[P, and = P,=r(O)[] d-|k,P) (8.2.43) 
m=0 j=l 


where P, = Pi = P}? (see Section 6.4). If we substitute (8.2.43) into (8.2.42), we obtain 


M-l 
H4(x) = M logn+M +M logr(0)+ Ý (M —m)log(I-|k,, P) (8.2.44) 

m=1 . 
which expresses the entropy in terms of 7(0) and the PACS k,,, 1 <m < M < oo [recall that any parametric 
model can be specified by (0) and the PACS]. Suppose now that we are given the first P+1 values 
r(0), r(1), ---, r(P) of the autocorrelation sequence and we wish to find a model, by choosing the remaining 
values r(l), l >P, so that the entropy is maximized. From (8.2.44), we see that the entropy is maximized if we 
choose k„=0 for m>P, that is, by modeling the process x(n) by an AR(P) model. In conclusion, among all 
regular Gaussian processes with the same first P+1 autocorrelation values, the AR(P) process has the maximum 
entropy. Any other choices for km, m> P, that satisfy the condition |k,, |< 1 lead to a valid extension of the 
autocorrelation sequence. The “extended” values r(/),/>P, can be obtained by using the inverse 
Levinson-Durbin or the inverse Schiir algorithm (see Chapter 6). The relation between autoregressive modeling and 
the principle of maximum entropy, known as the maximum entropy method, was introduced by Burg (1967, 1975). 
We note that the above proof, given in Porat (1994), is different from the original proof provided by Burg (Burg 1975; 
Therrien 1992). An interesting discussion of various arguments in favor of and against the maximum entropy method 
can be found in Makhoul (1986). 


8.2.4 Excitations with Line Spectra 


When the excitation of a parametric model has a spectrum with lines at L frequencies q@,, the spectrum of the 
output signal provides information about the frequency response of the model at these frequencies only. For 
simplicity, assume equidistant samples at frequencies @, =2am/L, 0 < m < L—1. Given a set of values 
R,(e!” ) =| X (e/”) ?, we wish to find an AP(P) model whose spectrum R,(e)®) matches R,(q@,,) at the given 
frequencies, by minimizing the criterion 
pony ee (8.2.45) 
L iat Ryle") 


which is the discrete version of (8.2.17) and d, is the gain of the model (see Section 3.2). The minimization of 
(8.2.45) with respect to the model parameters {a,} results in the Yule-Walker equations 


i é i=0 
> ajF(i-k) = l (8.2.46) 
zy 0 l1lsisP 
1< > : 
where r(l)= =, R (eje (8.2.47) 
m=] 
For continuous spectra, linear prediction uses the autocorrelation 
1 ee 
r(l)=—| R(e’’)e’? da (8.2.48) 
Üe [RE 


which is related to 7(/) by 


F)= >) r(l-Lm) (8.2.49) 


m=—co 
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that is, 7(/) is an aliased version of r(l). We have seen that linear prediction equates the autocorrelation of the 
AP(P) model to the autocorrelation of the modeled signal for the first P+1 lags. Hence, when we use linear prediction 
for a signal with line spectra, the autocorrelation of the all-pole model will be matched to F(/)#r(l) and will 
always result in a model different from the original. Clearly, the correlation matching condition cannot compensate 
for the autocorrelation aliasing, which becomes more pronounced as L decreases. This phenomenon, which is severe 
for voiced sounds with high pitch, is illustrated in Problem 8.13. A method that provides better estimates, by 
minimizing a discrete version of the Itakura-Saito error measure, has been developed for both AP and PZ models by 
El-Jaroudi and Makhoul (1991, 1989). 


8.3 Estimation of Pole-zero Models 


The estimation of PZ (P,Q) model parameters for Q #0 leads to a nonlinear LS optimization problem. As a result, 
a vast number of suboptimum methods, with reduced computational complexity, have been developed to avoid this 
problem. For example, some techniques estimate the AP(P) and AZ(Q) parameters separately. However, today the 
availability of high-speed computers has made exact least-squares the method of choice. Since the nonlinear LS 
optimization with respect to complex vectors and its conjugate is inherently difficult, and since this optimization does 
not provide any additional insight into the solution technique, we assume, in this section, that the quantities are 
real-valued. Furthermore, most of the real-world applications of pole-zero models almost always involve real-valued 
signals and systems. The extension to the complex-valued case is straightforward. 
Consider the PZ(P,Q) model 


P Q 
x(n)=-}_ a,x(n—k)+ @n)+ > d,an—-k) (8.3.1) 
k=l k=l 
where @(n) ~ WN(0, 02). Using vector notation, we can express (8.3.1) as 
x(n) =z" (n—l)e,, + an) (8.3.2) 
where z(n) ê [—x(n) «+» —x(n—P +1) wn) -- wWn-O 4D] (8.3.3) 
and c,,=[a' d"]=[a, --- apd, = do)" (8.3.4) 


8.3.1 Known Excitation 


Assume for a moment that the excitation w(n) is known. Then we can predict x(n) from past values, using the 
following linear predictor 


R(n) =z" (n—le (8.3.5) 


where c=[4, = Gpd, = dol" (8.3.6) 
are the predictor parameters. The prediction error 
e(n) = x(n) — &(n) = x(n)- z" (n—1)c (8.3.7) 
equals w(n) if c=c,,. Minimization of the total squared error 
N 
e(c) = > e(n) (8.3.8) 
n=N; 


leads to the following linear system of equations 
Rc=’: (8.3.9) 
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Ne 
where R.= X z(n-1)z"(n-1) (8.3.10) 
n=N, 
Ns 
and P.= >) z(n—1)x(n) (8.3.11) 
n=N; 


Usually, we use residual windowing, which implies that N,= max(P,Q) and N;=N-1. Since the matrix R, is 
symmetric and positive semidefinite, we can solve (8.3.9) using LDL” decomposition. Thus, if we know the 
excitation qn), the least-squares estimation of the PZ (P,Q) model parameters reduces to the solution of a linear 
system of equations. An estimate of the input variance is given by 


1 N-I 


ern e(n) (8.3.12) 
N — max(P, Q) _ 


This method, which is implemented by the function pz1s .m, is known as the equation-error method and can be used 
to identify a system from input-output data (Ljung 1987) (see Problem 8.14). 


a2 
Oo 


8.3.2 Unknown Excitation 


In most applications, the excitation w(n) is never known. However, we can obtain a good estimate of x(n) by 
replacing w(n) by e(n) in (8.3.5). This makes a natural choice if the model used to obtain e(n) is reasonably 
accurate. The prediction error is then given by 


e(n) = x(n) —X(n) = x(n) —2"(n—l)e (8.3.13) 
where a(n) = [—x(n) = —x(n—P +1) e(n) «+: e(n-—O4+)]' (8.3.14) 
If we write (8.3.13) explicitly 
Q P 
e(n)=-)_ G,e(n—k)+x(n) + Y âx(n-k) (8.3.15) 
k=1 k=l 


we see that the prediction error is obtained by exciting the inverse model with the signal x(n). Hence, the inverse 
model has to be stable. To satisfy this condition, we require the estimated model to be minimum-phase. 

The recursive computation of e(n) by (8.3.15) makes the prediction error a nonlinear function of the model 
parameters. To illustrate this, consider the prediction error for a first-order model, that is, for P =Q =1 


e(n)=x(n)+âx(n-1)-åe(n-1) 
Assuming e(0)=0, we have for n=1, 2, 3 
e(1) = x(1) + âx(0) 
e(2) = x(2) + âx(1)- ge) 
= x(2) + (â, —d,)x(1) — ig x(0) 
e(3) = x(3) + âx(2) - â e(2) 


= x(3) + (4, —d, )x(2) — (4, —d,)d,x(1) +.4,d?x(0) 
which shows that e(n) is a nonlinear function of the model parameters if Q #0. Thus, the total squared error 


Ny 
E(c)= >) e(n) (8.3.16) 
n=N; 
expressed in terms of the signal values x(Q), x(1), ---, x(NŅN —1), is a nonquadratic function of the model 


parameters. Sometimes, €(c) has several local minima. The model parameters can be obtained by minimizing the 
total square error using nonlinear optimization techniques. 
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8.4 Applications 


Pole-zero modeling has many applications in such fields as spectral estimation, speech processing, geophysics, 
biomedical signal processing, and general time series analysis and forecasting (Marple 1987; Kay 1988; Robinson 
and Treitel 1980; Box, Jenkins, and Reinsel 1994). In this section, we discuss the application of pole-zero models to 
spectral estimation and speech processing. 


8.4.1 Spectral Estimation 


After we have estimated the parameters of a PZ model, we can compute the PSD of the analyzed process by 








(8.4.1) 








In practice, we mainly use AP models because (1) the all-zero PSD estimator is essentially identical to the 
Blackman-Tukey one (see Problem 8.16) and (2) the application of pole-zero PSD estimators is limited by 
computational and other practical difficulties. Also, any continuous PSD can be approximated arbitrarily well by the 
PSD of an AP(P) model if P is chosen large enough (Anderson 1971). However, in practice, the value of P is limited 
by the amount of available data (usually P < N/3). The statistical properties of all-pole PSD estimators are difficult 
to obtain; however, it has been shown that the estimator is consistent only if the analyzed process is AR(Po) with 
P, < P . Furthermore, the quality of the estimator degrades if the process is contaminated by noise. More details 
about pole-zero PSD estimation can be found in Kay (1988), Porat (1994), and Percival and Walden (1993). 

The performance of all-pole PSD estimators depends on the method used to estimate the model parameters, the 
order of the model, and the presence of noise. The effect of model mismatch is shown in Figure 8.13 and is further 
investigated in Problem 8.17 Order selection in all-pole PSD estimation is absolutely critical: If P is too large, the 
obtained PSD exhibits spurious peaks; if P is too small, the structure of the PSD is smoothed over. The increased 
resolution of the parametric techniques, compared to the nonparametric PSD estimation methods, is basically the 
result of imposing structure on the data (i.e., a model). The model makes possible the extrapolation of the ACS, 
which in turns leads to better resolution. However, if the adopted model is inaccurate, that is, if it does not match the 
data, then the “gained” resolution reflects the model and not the data! As a result, despite their popularity and their 
“success” with simulated signals, the application of parametric PSD estimation techniques to actual experimental data 
is rather limited. 

Figure 8.13 shows the results of a Monte Carlo simulation of various all-pole PSD estimation techniques. We see 
that, except for the windowing approach that results in a significant loss of resolution, all other techniques have similar 
performance. However, we should mention that the forward/backward LS all-pole modeling method is considered to 
provide the best results (Marple 1987). 

In practice, it is our experience that the best way to estimate the PSD of an actual signal is to combine 
parametric prewhitening with nonparametric PSD estimation methods. The process is illustrated in Figure 8.14 and 
involves the following steps: 


1. Fit an AP(P) model to the data using the forward LS, forward/backward LS, or Burg’s method with no 
windowing. 
2. Compute the residual (prediction error) 


P 
e(n)=x(n)+ > ajx(n-k) =P <n<N-1 (8.4.2) 
k=l 


and then compute and plot its ACS, PACS, and cumulative periodogram (see Figure 8.2) to see if it is reasonably 
white. The goal is not to completely whiten the residual but to reduce its spectral dynamic range, that is, to increase 
its spectral flatness to avoid spectral leakage. 


3. Compute the PSD R,(e!“), using one of the nonparametric techniques discussed in Chapter 4. 
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4. Compute the PSD of x(n) by 

Re ) 

| A(e’) f 
that is, by applying postcoloring to “undo” the prewhitening. 
The main goal of AP modeling here is to reduce the spectral dynamic range to avoid leakage. In other words, we 

need a good linear predictor regardless of whether the process is true AR( P ). Therefore, very accurate order 

selection and model fit are not critical, because all spectral structure not captured by the model is still in the residuals. 

Needless to say, if the periodogram of x(n) has a small dynamic range, we do not need prewhitening. Another 

interesting application of prewhitening is for the detection of outliers in practical data (Martin and Thomson 1982). 
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FIGURE 8.13 
Monte Carlo simulation for the comparison of all-pole PSD estimation techniques, using 50 realizations of a 50-sample segment from 
an AR(4) process using fourth-order AP models. 
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FIGURE 8.14 
Block diagram of nonparametric PSD estimation using linear prediction prewhitening. 


EXAMPLE 8.4.1. To illustrate the effectiveness of the above prewhitening and postcoloring method, consider the AR(4) process 
x(n) used in Example 8.2.3. This process has a large dynamic range, and hence the nonparametric methods such as Welch’s 
periodogram averaging method will suffer from leakage problems. Using the system function of the model 


1 1 
A(z) =— = — 
i A(z) 1—2.7607z' +3.8106z* —2.6535z° +0.9238z* 
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and WGN (0,1) input sequence, we generated 256 samples of x(n). These samples were then used to obtain the all-pole LS 
predictor coefficients using the arwin function. The spectrum | A(e’”) i" corresponding to this estimated model is shown in 
Figure 8.15 as a dashed curve. The signal samples were prewhitened using the model to obtain the residuals e(n). The 
nonparametric PSD estimate ĝ (ei?) of e(n) was computed by using Welch’s method with L =64 and 50 percent overlap. 


Finally, Re” ) was postcolored using the spectrum | A(e!”) F to obtain ĝe), which is shown in Figure 8.16 as a solid 


line. For comparison purposes, the Welch PSD estimate of x(n) is also shown as a dotted line. As expected, the nonparametric 
estimate does not resolve the two peaks in the true spectrum and suffers from leakage at high frequencies. However, the combined 
nonparametric and parametric estimate resolves two peaks with ease and also follows the true spectrum quite well. Therefore, the 
use of the parametric method as a preprocessor is highly recommended especially in large-dynamic-range situations. 


PSD estimation of AR(4) signal 


Power (dB) 
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FIGURE 8.15 
Spectral estimation of AR(4) process using prewhitening and postcoloring method in Example 8.4.1 


8.4.2 Speech Modeling 


All-pole modeling using LS linear prediction is widely employed in speech processing applications because (1) it 
provides a good approximation to the vocal tract for voiced sounds and adequate approximation for unvoiced and 
transient sounds, (2) it results in a good separation between source (fine spectral structure) and vocal tract (spectral 
envelop), and (3) it is analytically tractable and leads to efficient software and hardware implementations. 

Figure 8.16 shows a typical AP modeling system, also known as the linear predictive coding (LPC) processor, 
that is used in speech synthesis, coding, and recognition applications. The processor operates in a block processing 
mode; that is, it processes a frame of N samples and computes a vector of model parameters using the following basic 
steps: 


a N No w(n) 


Frame blocking 





LPC parameter eee Durbin Autocorrelation 
conversion or seer computation 
algorithm 


Block diagram of an AP modeling processor for speech coding and recognition. 


FIGURE 8.16 
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1. Preemphasis. The digitized speech signal is filtered by the high-pass filter 
H\(z)=1-az' 09<a<l (8.4.4) 


to reduce the dynamic range of the spectrum, that is, to flatten the spectral envelope, and make subsequent 
processing less sensitive to numerical problems (Makhoul 1975a). Usually œ = 0.95, which results in about a 32 
dB boost in the spectrum at @=a over that at @=(. The preemphasizer can be made adaptive by setting 
a= (1), where p(l) is the normalized autocorrelation of the frame, which corresponds to a first-order 
optimum prediction error filter. 

. Frame blocking. Here the preemphasized signal is blocked into frames of N samples with successive frames 
overlapping by No= N/3 samples. In speech recognition N =300 with a sampling rate F,=6.67 Hz, which 
corresponds to 45-ms frames overlapping by 15 ms. 

3. Windowing. Each frame is multiplied by an N-sample window (usually Hamming) to smooth the discontinuities at 

the beginning and the end of the frame. 

4. Autocorrelation computation. Here the LPC processor computes the first P+1 values of the autocorrelation 
sequence. Usually, P=8 in speech recognition and P=12 in speech coding applications. The value of r(0) 
provides the energy of the frame, which is useful for speech detection. 

5. LPC analysis. In this step the processor uses the P+1 autocorrelations to compute an LPC parameter set for 
each speech frame. Depending on the required parameters, we can use the algorithm of Levinson-Durbin or 
the algorithm of Schiir. The most widely used parameters are 


N 





a„ sa? LPC coefficients 
k,, PACS 
1, i-k > > , 
Em =~ log “=tanh k„ log area ratio coefficients 
2 1+kk 
c(m) cepstral coefficients 
O, line spectrum pairs 


where 1 < m < P, except for the cepstrum, which is computed up to about 3P / 2. The line spectrum pair 
parameters, and their application to speech processing is considered in Furui (1989). 

The log area ratio and the line spectrum pair coefficients have good quantization properties and are used 
for speech coding (Rabiner and Schafer 1978; Furui 1989); the cepstral coefficients provide an excellent 
discriminant for speech and speaker recognition applications (Rabiner and Juang 1993; Mammone et al. 1996). 
AP models are extensively used for the modeling of speech sounds. However, the AP model does not provide an 
accurate description of the speech spectral envelope when the speech production process resembles a PZ system 
(Atal and Schroeder 1978). This can happen when (1) the nasal tract is coupled to the main vocal tract through 
the velar opening, for example, during the generation of nasals and nasalized sounds, (2) the source of excitation 
is not at the glottis but is in the interior of the vocal tract (Flanagan 1972), and (3) the transmission or recording 
channel has zeros in its response. Although a zero can be approximated with arbitrary precision by a number of 
poles, this approximation is usually inefficient and leads to spectral distortion and other problems. These 
problems can be avoided by using pole-zero modeling, as illustrated in the following example. More details 
about pole-zero speech modeling can be found in Atal and Schroeder (1978). 

Figure 8.17(a) shows a Hamming window segment from an artificial nasal speech signal sampled at 
F, = 10 kHz. According to acoustic theory, such sounds require both poles and zeros in the vocal tract system 
function. Before the fitting of the model, the data are passed though a preemphasis filter with œ = 0.95. Figure 
8.18(b) shows the periodogram of the speech segment, the spectrum of an AP(16) model using data windowing, 
and the spectrum of a PZ(12, 6) model using the least-squares algorithm. We see that the pole-zero model 
matches zeros (“valleys”) in the periodogram of the data better than other models do. 
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FIGURE 8.17 


(a) Speech segment and; (b) periodogram, spectrum of a data windowing-based AP(16) model, and spectrum of a residual 
windowing-based PZ(12, 6)model. 


8.5 Harmonic Models and Frequency Estimation Techniques 


The pole-zero models we have discussed so far assume a linear time-invariant system that is excited by white noise. 
However, in many applications, the signals of interest are complex exponentials contained in white noise for which a 
sinusoidal or harmonic model is more appropriate. Signals consisting of complex exponentials are found as formant 
frequencies in speech processing, moving targets in radar, and spatially propagating signals in array processing.” For 
real signals, complex exponentials make up a complex conjugate pair (sinusoids), whereas for complex signals, they 
may occur at a single frequency. 

For complex exponentials found in noise, the parameters of interest are the frequencies of the signals. Therefore, 
our goal is to estimate these frequencies from the data. One might consider estimating the power spectrum by using 
the nonparametric methods discussed in Chapter 4. The frequency estimates of the complex exponentials are then the 
frequencies at which peaks occur in the spectrum. Certainly, the use of these nonparametric methods seems 
appropriate for complex exponential signals since they make no assumptions about the underlying process. We might 
also consider making use of an all-pole model for the purposes of spectrum estimation as discussed in Section 8.4.1, 
also known as the maximum entropy method (MEM) spectral estimation technique. Even though some of these 
methods can achieve very fine resolution, none of these methods accounts for the underlying model of complex 
exponentials in noise. As in all modeling problems, the use of the appropriate model is desirable from an intuitive 
point of view and advantageous in terms of performance. We begin by describing the harmonic signal model, 
deriving the model in a vector notation, and looking at the eigendecomposition of the correlation matrix of complex 
exponentials in noise. Then we describe frequency estimation methods based on the harmonic model: the Pisarenko 
harmonic decomposition, and the MUSIC, minimum-norm, and ESPRIT algorithms. 

These methods have the ability to resolve complex exponentials closely spaced in frequency and has led to the 
name superresolution commonly being associated with them. However, a word of caution on the use of these 


"In array processing, a spatially propagating wave produces a complex exponential signal as measured across uniformly spaced sensors in 
an array. The frequency of the complex exponential is determined by the angle of arrival of the impinging, spatially propagating signal. 
Thus, in array processing the frequency estimation problem is known as angle-of-arrival (AOA) or direction-of-arrival (DOA) estimation. 


272 Statistical and Adaptive Signal Processing 


harmonic models. The high level of performance in terms of resolution is achieved by assuming an underlying model 
of the data. As with all other parametric methods, the performance of these techniques depends upon how closely this 
mathematical model matches the actual physical process that produced the signals. Deviations from this assumption 
result in model mismatch and will produce frequency estimates for a signal that may not have been produced by 
complex exponentials. In this case, the frequency estimates have little meaning. 


8.5.1 Harmonic Model 


Consider the signal model that consists of P complex exponentials in noise 


P . 
x(n) => a," + an) (8.5.1) 
p=l 
The normalized, discrete-time frequency of the p th component is 
o, F 
52 =- (8.5.2) 
o 2a E 


where q@, is the discrete-time frequency in radians, F, is the actual frequency of the pth complex exponential, and 
F, is the sampling frequency. The complex exponentials may occur either individually or in complex conjugate pairs, 
as in the case of real signals. In general, we want to estimate the frequencies and possibly also the amplitudes of these 
signals. Note that the phase of each complex exponential is contained in the amplitude, that is, 


æ, =| æ, |e” (8.5.3) 


where the phases y, are uncorrelated random variables uniformly distributed over [0,27]. The 
magnitude |œ, | and the frequency f, are deterministic quantities. If we consider the spectrum of a harmonic 
process, we note that it consists of a set of impulses with a constant background level at the power of the white noise 
o2, = E{| a(n) |?}. As a result, the power spectrum of complex exponentials is commonly referred to as a line 
spectrum, as illustrated in Figure 8.18. 


Noise 
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FIGURE 8.18 
The spectrum of complex exponentials in noise 


Since we will make use of matrix methods based on a certain time window of length M, it is useful to 
characterize the signal model in the form of a vector over this time window consisting of the sample delays of the 
signal. Consider the signal x(n) from (8.5.1) at its current and future M —1 values. This time window can be 
written as 


x(n) =[x(n)x (n+1) «+ x(n+M -1)f (8.5.4) 


We can then write the signal model consisting of complex exponentials in noise from (8.5.1) for a length-M 
time-window vector as 


P 
x(n)= > a,v(f, er” + w(n) = s(n) + w(n) (8.5.5) 
p=l 
where w(n)=[a@(n) @(n+1) --- @n+M —1)]' is the time-window vector of white noise and 
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WPS e” o e N (8.5.6) 


is the time-window frequency vector. Note that v(f) is simply a length-M DFT vector at frequency f. We 
differentiate here between the signal s(n), consisting of the sum of complex exponentials, and the noise component 
w(n), respectively. 

Consider the time-window vector model consisting of a sum of complex exponentials in noise from (8.5.5). The 
autocorrelation matrix of this model can be written as the sum of signal and noise autocorrelation matrices 


R, = E{x(n)x"(n)}=R,+R,, 
: H 2 fn? (8.5.7) 
=F |a, P vf," (f,)+o,1 =VAV" +071 


p=l 


where V=f[v (fD) = fe) (8.5.8) 
is an MXP matrix whose columns are the time-window frequency vectors from (8.5.6) at frequencies f, of the 
complex exponentials and 


laf O + 0 
20°, $ 
aal 9 laf e o: (8.5.9) 
: ~~. “sy 
O - 0 Jaf 


is a diagonal matrix of the powers of each of the respective complex exponentials. The autocorrelation matrix of the 
white noise is 


R,= 0I (8.5.10) 


which is full rank, as opposed to R, which is rank-deficient for P <M . In general, we will always choose the 
length of our time window M to be greater than the number of complex exponentials P. 
The autocorrelation matrix can also be written in terms of its eigendecomposition 


M 
R, =}_4,4„q;, = QAO" (8.5.11) 
m=1 


where 4, are the eigenvalues in descending order, that is, 4,2>A,2---2/4,, and q, are their 
corresponding eigenvectors. Here A is a diagonal matrix made up of the eigenvalues found in descending order on 
the diagonal, while the columns of Q are the corresponding eigenvectors. The eigenvalues due to the signals can be 


written as the sum of the signal power in the time window and the noise: 
A, =M |æ} +0} for m<P (8.5.12) 
The remaining eigenvalues are due to the noise only, that is, 
A,=0, for m>P (8.5.13) 


Therefore, the P largest eigenvalues correspond to the signal made up of complex exponentials and the remaining 
eigenvalues have equal value and correspond to the noise. Thus, we can partition the correlation matrix into portions 
due to the signal and noise eigen-vectors 


P M 
R, =} M |a, P +0 + >, Anan 


m=l m=P+1 


=Q AD + 70,00 


(8.5.14) 


where Q. =14, 4, = Wp) Q, = [apa i ay] (8.5.15) 
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are matrices whose columms consist of the signal and noise eigenvectors, respectively. The matrix A, isa PxP 
diagonal matrix containing the signal eigenvalues from (8.5.12). Thus, the M -dimensional subspace that contains 
the observations of the time-window signal vector from (8.5.5) can be split into two subspaces spanned by the signal 
and noise eigenvectors, respectively. These two subspaces, known as the signal subspace and the noise subspace, are 
orthogonal to each other since the correlation matrix is Hermitian symmetric.’ All the subspace methods discussed 
later in this section rely on the partitioning of the vector space into signal and noise subspaces. Recall from Chapter 7 
in (7.2.29) that the projection matrix from an M -dimensional space onto an L -dimensional subspace (L <M ) 
spanned by a set of vectors Z = [ZZZ] is 


P=Z(Z"Zy' Z" (8.5.16) 
Therefore, we can write the matrices that project an arbitrary vector onto the signal and noise subspaces as 


P,=0,0"  P,=0,05 (8.5.17) 


since the eigenvectors of the correlation matrix are orthonormal (Q¥Q, =I and Q#Q,„= [I ). Since the two 
subspaces are orthogonal 


P,Q. =0 PQ,=0 (8.5.18) 
then all the time-window frequency vectors from (8.5.5) must lie completely in the signal subspace, that is, 
Py(f,)=v(f,)  Pv(f,)=0 (8.5.19) 


These concepts are central to the subspace-based frequency estimation methods discussed in Sections 8.6.2 through 
8.6.5. 

Note that in our analysis, we are considering the theoretical or true correlation matrix R,. In practice, the 
correlation matrix is not known and must be estimated from the measured data samples. If we have a time-window 
signal vector from (8.5.4), then we can form the data matrix by stacking the rows with measurements of the 
time-window data vector at a time n 


x'(0) x(0) a) = x(M —1) 
x"(1) x(1) x2) = x(M) 
X=| xm |=| x(n)  x(n+1) © x(nt+M-1) (8.5.20) 
x™(N —2) x(N-2) x(N-1) =- x(N+M  -3) 
x'(N-1) xN- xN) «+ x(N+M —2) 


which has dimensions of NXM, where N is the number of data records or snapshots and M is the time-window 
length. From this matrix, we can form an estimate of the correlation matrix, referred to as the sample correlation 
matrix 


g- x's (8.5.21) 
“ON 

In the case of an estimated sample correlation matrix, the noise eigenvalues are no longer equal because of the finite 
number of samples used to compute Ê . Therefore, the nice, clean threshold between signal and noise eigenvalues, as 
described in (8.5.12) and (8.5.13), no longer exists. The model order estimation techniques discussed in Section 8.2 
can be employed to attempt to determine the number of complex exponentials P present. In practice, these methods 

are best used as rough estimates, as their performance is not very accurate, especially for short data records. 
For several of the frequency estimation techniques described in this section, the analysis considers the use of 


>The eigenvectors of a Hermitian symmetric matrix are orthogonal. 
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eigenvalues and eigenvectors of the correlation matrix for the purposes of defining signal and noise subspaces." In 
practice, we estimate the signal and noise subspaces by using the eigenvectors and eigenvalues of the sample 
correlation matrix. Note that for notational expedience we will not differentiate between eigenvectors and eigenvalues 
of the true and sample correlation matrices. However, the reader should always keep in mind that the sample 
correlation matrix eigendecomposition is what must be used for implementation. We note that use of an estimate 
rather than the true correlation matrix will result in a degradation in performance, the analysis of which is beyond the 
scope of this book. 


8.5.2 Pisarenko Harmonic Decomposition 


The Pisarenko harmonic decomposition (PHD) was the first frequency estimation method proposed that was based 
on the eigendecomposition of the correlation matrix and its partitioning into signal and noise subspaces (Pisarenko 
1973). This method uses the eigenvector associated with the smallest eigenvalue to estimate the frequencies of the 
complex exponentials. Although this method has limited practical use owing to its sensitivity to noise, it is of great 
theoretical interest because it was the first method based on signal and noise subspace principles and it helped to fuel 
the development of many well-known subspace methods, such as MUSIC and ESPRIT. 

Consider the model of complex exponentials contained in noise in (8.5.5) and the eigendecomposition of its 
correlation matrix in (8.5.14). The eigenvector corresponding to the minimum eigenvalue must be orthogonal to all 
the eigenvectors in the signal subspace. Thus, we choose the time window to be of length 

M=P+1 (8.5.22) 


that is, 1 greater than the number of complex exponentials. Therefore, the noise subspace consists of a single 
eigenvector 


Q,.=4u (8.5.23) 


corresponding to the minimum eigenvalue A, . By virtue of the orthogonality between the signal and noise 
subspaces, each of the P complex exponentials in the time-window signal vector model in (8.5.5) is orthogonal to this 
eigenvector 


Mi 2n — 
aA MNA wr =0 «for msP (8.5.24) 
k=l 
Making use of this property, we can compute 


1 1 
Ie"(fay P  |Qy(e?™)P 


which is commonly referred to as a pseudospectrum. The frequencies are then estimated by observing the P peaks 
in Rypale/?”/). Note that since (8.5.25) requires a search of all frequencies —0.5 < f <0.5, in practice a dense 
sampling of the frequencies is generally necessary. The quantity 


M 
Qu (eT) =V" (fia => ale (8.5.26) 
k=l 


Rue i= (8.5.25) 


is simply the Fourier transform of the Mth eigenvector corresponding to the minimum eigenvalue. 
Thus, the pseudospectrum for the Pisarenko harmonic decomposition R,,,,(e!7% ) can be efficiently implemented by 
computing the FFT of qg,, with sufficient zero padding to provide the necessary frequency resolution. Then 
R nae”) is simply the reciprocal of the spectrum of the noise eigenvector, that is, the squared magnitude of its 
Fourier transform. Note that R,,,(e%) is not an estimate of the true power spectrum since it contains no 
information about the powers of the complex exponentials | œ, > or the background noise level go}. However, 
these amplitudes can be found by using the estimated frequencies and the corresponding time-window frequency 


vectors along with the relationship of eigenvalues and eigenvectors. See Problem 8.24 for details. 
Alternately, the frequencies of the complex exponentials can be found by computing the zeros of the Fourier 


“The ESPRIT method uses a singular value decomposition of data matrix X. 
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transform of the Mth eigenvector in (8.5.23). The z-transform of this eigenvector is 
M M-I . 
Oy(z)= > dy (k)z* =[[ 0- z) (8.5.27) 
k=l k=l 


where the phases of the P=M —1 roots of this polynomial are the frequencies f, of the P =M —1 complex 
exponentials. 

As we stated up front, the significance of the Pisarenko harmonic decomposition is seen mostly from a 
theoretical perspective. The limitations of its practical use stem from the fact that it uses a single noise eigenvector 
and, as a result, lacks the necessary robustness needed for most applications. Since the correlation matrix is not 
known and must be estimated from data, the resulting noise eigenvector of the estimated correlation matrix is only an 
estimate of the actual noise eigenvector. Because we only use one noise eigenvector, this method is very sensitive to 
any errors in the estimation of the noise eigenvector. 


EXAMPLE 8.5.1 We demonstrate the use of the Pisarenko harmonic decomposition with a sinusoid in noise. The amplitude and 
frequency of the sinusoid are @=] and f = 0.2, respectively. The additive noise has unit power (o2 = 1 ). Using MATLAB, 
this signal is generated: 

x=sin(2*pi*f* [0:N-1]]')+(randn(n,1)+j*randn(N,1)) /sqrt (2); 

Since the number of complex exponentials is equal to P=2 (a complex conjugate pair for a sinusoid), the time-window 
length is chosen to be M=3. After forming the NxM data matrix X and computing the 
sample correlation matrix  , we can compute the pseudospectrum as follows: 

[Q0, D]=eig(R); %eigendecomposition 

{lambda, indes]=sort (abs(diag(d))); %torder by eigenvalue magnitude 
lambda=lambda(M:-1:1); Q=Q0(:,index(M:-1:1)); 

Rbar=1./abs (fftshift (£ft (Q(:,M),Nf£t))) .*2; 

Figure 8.19 shows the pseudospectrum of the Pisarenko harmonic decomposition for a single realization with an FFT size of 
1024. Note the two peaks near f = +0.2 . Recall that this is a pseudospectrum, so that the actual values do not correspond to an 
estimate of power. A MATLAB routine for estimating frequencies using the Pisarenko harmonic decomposition is provided in 
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Pseudospectrum for the Pisarenko harmonic decomposition of a sinusoid in noise with frequency f= 0.2 


8.5.3 MUSIC Algorithm 


The multiple signal classification (MUSIC) frequency estimation method was proposed as an improvement on 
the Pisarenko harmonic decomposition (Bienvenu and Kopp 1983; Schmidt 1986). Like the Pisarenko harmonic 
decomposition, the M-dimensional space is split into signal and noise components using the eigenvectors of the 
correlation matrix from (8.5.15). However, rather than limit the length of the time window to M = P +1, that is, 1 
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greater than the number of complex exponentials, allow the size of the time window to be M > P+1. Therefore, the 
noise subspace has a dimension greater than 1. Using this larger dimension allows for averaging over the noise 
subspace, providing an improved, more robust frequency estimation method than Pisarenko harmonic decomposition. 

Because of the orthogonality between the noise and signal subspaces, all the time-window frequency vectors of 
the complex exponentials are orthogonal to the noise subspace from (8.5.19). Thus, for each eigenvector 
(P<m<sM) 


M F 
v” (f An = > dp (k) OY =O (8.5.28) 
k=1 


for all the P frequencies f, of the complex exponentials. Therefore, if we compute a pseudospectrum for each noise 
eigenvector as 
1 2 1 

eer laK 
the polynomial Q, (e?™) has M -—1 roots, P of which correspond to the frequencies of the complex exponentials. 
These roots produce P peaks in the pseudospectrum from (8.5.29). Note that the pseudospectra of all M —P noise 
eigenvectors share these roots that are due to the signal subspace. The remaining roots of the noise eigenvectors, 
however, occur at different frequencies. There are no constraints on the location of these roots, so that some may be 
close to the unit circle and produce extra peaks in the pseudospectrum. A means of reducing the levels of these 
spurious peaks in the pseudospectrum is to average the M —P pseudospectra of the individual noise eigenvectors 
a a EE. ee (8.5.30) 

M M : 

Dd aF ZIOEN 
m=P+1 m=P+1 
which is known as the MUSIC pseudospectrum. The frequency estimates of the P complex exponentials are then 
taken as the P peaks in this pseudospectrum. Again, the term pseudospectrum is used because the quantity in (8.5.30) 
does not contain information about the powers of the complex exponentials or the background noise level. Note that 
for M =P-+1, the MUSIC method is equivalent to Pisarenko harmonic decomposition. 

The implicit assumption in the MUSIC pseudospectrum is that the noise eigenvalues all have equal power 
Am = 62,, that is, the noise is white. However, in practice, when an estimate is used in place of the actual correlation 
matrix, the noise eigenvalues will not be equal. The differences become more pronounced when the correlation 
matrix is estimated from a small number of data samples. Thus, a slight variation on the MUSIC algorithm, known as 
the eigenvector (ev) method, was proposed to account for the potentially different noise eigenvalues (Johnson and 
DeGraaf 1982). For this method, the pseudospectrum is 
1 1 


R„(€”7) = (8.5.29) 


Raae ) = 


R.(e”) = -—— = (8.5.31) 
> tl"WwaP > +12,e"")P 
m=P+1 k=P+1 


where A,„ is the eigenvalue corresponding to the eigenvector qm. The pseudospectrum of each eigenvector is 
normalized by its corresponding eigenvalue. In the case of equal noise eigenvalues ( 4,,=02, ) for 
P+1 <m <M, the eigenvector and MUSIC methods are identical. 


The peaks in the MUSIC pseudospectrum correspond to the frequencies at which the denominator in (8.5.30) 


M 
a lQn (e?™) P approaches zero. Therefore, we might want to consider the z -transform of this denominator 


m=P+1 


M 
Prmic(Z)= >, O,,(2)0% (=) (8.5.32) 


m=P+1 Z 
which is the sum of the z-transforms of the pseudospectrum due to each noise eigenvector. This (2M —1) th-order 
polynomial has M —1 pairs of roots with one inside and one outside the unit circle. Since we assume that the 
complex exponentials are not damped, their corresponding roots must lie on the unit circle. Thus, if we have found 
the M —1 roots of (8.5.32), the P closest roots to the unit circle will correspond to the complex exponentials. The 
phases of these roots are then the frequency estimates. This method of rooting the polynomial corresponding to the 
MUSIC pseudospectrum is known as root-MUSIC (Barabell 1983). Note that in many cases, a rooting method is 
more efficient than computing a pseudospectrum at a very fine frequency resolution that may require a very large 
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FFT. Statistical performance analyses of the MUSIC algorithm can be found in Kaveh and Barabell (1986) and Stoica 
and Nehorai (1989). For the performance of the root- MUSIC method see Rao and Hari (1989). A routine for the 
MUSIC algorithm is provided in music.m and a routine for the root-MUSIC algorithm is provided in 
rootmusic.m. 


EXAMPLE 8.5.2. In this example, we demonstrate the use of the MUSIC algorithm and examine its performance in terms of 
resolution with respect to that of the minimum-variance spectral estimator. Consider the following scenario: Two complex 
exponentials in unit power noise (G2, =1) with normalized frequencies f =0.1, 0.2 both with amplitudes of @=1. We 
generate N =128 samples of the signal and use a frequency vector of length M = 8. Proceeding as we did in Example 8.5.1, we 
compute the eigendecomposition and partition it into signal and noise subspaces. The MUSIC pseudospectrum is computed as 

Qbar=zeros(Nfft, 1); 

for n=1: (M-P) 

Qbar=Qbar+abs (fftshift (fft (Q(:,M-(n-1)),Nf£f£t))).*2; 

end 

Rbar=1./Qbar; 

The minimum-variance spectral estimate and the MUSIC pseudospectrum are computed and averaged over 1000 realizations 
using an FFT size of 1024. The result is shown in Figure 8.20. The two exponentials have been clearly resolved using the MUSIC 
algorithm, whereas they are not very clear using the minimum-variance spectral estimate. Since the minimum-variance spectral 
estimator is nonparametric and makes no assumptions about the underlying model, it cannot achieve the resolution of the MUSIC 
algorithm. 


8.5.4 Minimum-Norm Method 


The minimum-norm method (Kumaresan and Tufts 1983), like the MUSIC algorithm, uses a time-window vector of 
length M > P+1 for the purposes of frequency estimation. For MUSIC, a larger time window is used than for 
Pisarenko harmonic decomposition, resulting in a larger noise subspace. The use of a larger subspace provides the 
necessary robustness for frequency estimation when an estimated correlation matrix is used. The same principle is 
applied in the minimum-norm frequency estimation method. However, rather than average the pseudospectra of all 
the noise subspace eigenvectors to reduce spurious peaks, as in the case of the MUSIC algorithm, a different 
approach is taken. 
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FIGURE 8.20 


Comparison of the minimum-variance spectral estimate (dashed line) and the MUSIC pseudospectrum (solid line) for two complex 
exponentials in noise. 


Consider a single vector u contained in the noise subspace. The pseudospectrum of this vector is given by 
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er 1 
R(e?* ) = —__—_ (8.5.33) 
|v" (ful? 


Since the vector u lies in the noise subspace, its pseudospectrum in (8.5.33) has P peaks corresponding to the 
complex exponentials in the signal subspace. However, u is length M so that its pseudospectrum may exhibit an 
additional M —P-—1 peaks that do not correspond to the frequencies of the complex exponentials. These spurious 
peaks lead to frequency estimation errors. In the case of Pisarenko harmonic decomposition, spurious peaks were not 
a concern since M =P-+1 and therefore its pseudospectrum in (8.5.25) only had P peaks. On the other hand, the 
MUSIC algorithm diluted the strength of these spurious peaks since its pseudospectrum in (8.5.30) is produced by 
averaging the pseudospectra of the M —P noise eigenvectors. 

Recall the projection onto the noise subspace from (8.6.17) is 


P,=0,0; (8.5.34) 
where Q, is the matrix of noise eigenvectors. Therefore, for any vector U that lies in the noise subspace 
Pu=u Pu=0 (8.5.35) 


where P, is the signal subspace projection matrix and 0 is the length-P zero vector. Now let us consider the 
z -transform of the coefficients of u =[u(1)u(2)---u(M)]" 


M-I P M-i 
U(z)= > ulkl)z* =] ] d-e™ 2") J] a-z) (8.5.36) 
k=0 k=l k=P+1 


This polynomial is the product of the P roots corresponding to complex exponentials that lie on the unit circle and 
the M -—P-—1 roots that in general do not lie directly on the unit circle but can potentially produce spurious peaks in 
the pseudospectrum of u. Therefore, we want to choose u so that it minimizes the spurious peaks due to these other 
roots of its associated polynomial U (z). 

The minimum-norm method, as its name implies, seeks to minimize the norm of u in order to avoid spurious 
peaks in its pseudospectrum. Using (8.5.35), the norm of a vector u contained in the noise subspace is 


lul =u"u=u"P u (8.5.37) 


However, an unconstrained minimization of this norm will produce the zero vector. Therefore, we place the 
constraint that the first element of u must equal 1.° This constraint can be expressed as 


ðu =1 (8.5.38) 


where 6, =[1 0 --- OT . Then the determination of the minimum-norm vector comes down to solving the following 
constrained minimization problem: 


min |u| =u"P,u subject to ôu=1 (8.5.39) 


The solution can be found by using Lagrange multipliers (see Appendix B) and is given by 
P.O, 
Unn = <a 
ô P ò, 
The frequency estimates are then obtained from the peaks in the pseudospectrum of the minimum-norm (mn) vector, 





(8.5.40) 


Umn 


=. a 1 
Rk”) = — (8.5.41) 
[VS ttnn È 
The performance of the minimum-norm frequency estimation method is similar to that of MUSIC. For a 
performance comparison see Kaveh and Barabell (1986). Note that it is also possible to implement the 


$ The choice of a value of 1 is somewhat arbitrary, since any nonzero constant will result in a similar solution. 
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minimum-norm method by rooting a polynomial rather than computing a psuedospectrum (see Problem 8.25). 


EXAMPLE 8.5.3. In this example, we illustrate the use of the minimum-norm method and compare its performance to that of the 
other three frequency estimation methods discussed in this chapter: Pisarenko harmonic decomposition, the MUSIC algorithm, and 
the eigenvector method. The pseudospectrum of the minimum-norm method is found by first computing the minimum-norm vector 
Umm and then finding its pseudospectrum, that is, 


deltal=zeros(M, 1); deltal(1)=1; 


Pn=Q(:, (P+1) :M)*Q(:, (P+1):M)'; % noise subspace projection matrix 
u=(Pn*el) /(el'*Pn*el) ; % minimum-norm vector 
Rbar=1./abs(fftshift (fft(u, Nfft))). *2; % pseudospectrum 


Consider the case of P=4 complex exponentials in noise with frequencies f =0.1, 0.25, 0.4, and —0.1, all with an 
amplitude of @=1. The power of the noise is set to @,=1 with 100 realizations. The time-window length used was M =8 
for all the methods except Pisarenko harmonic decomposition, which is constrained to use M = P +1 = 5 . The pseudospectra are 
shown in Figure 8.21 with an FFT size of 1024, where we have not averaged in order to demonstrate the variance of the various 
methods. Here we see the large variance in the frequency estimates that is produced by Pisarenko harmonic decomposition 
compared to the other methods, which is a direct result of using a one-dimensional noise subspace. The other methods all perform 
comparably in terms of estimating the frequencies of the complex exponentials. Note the fluctuations in the pseudospectrum of the 
eigenvector method that result from the normalization by the eigenvalues. Since these eigenvalues vary over realizations, the 
pseudospectra will also reflect a similar variation. Routines for the eigenvector method and the minimum-norm method are 
provided in ev_method.m and minnorm. m, respectively. 
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FIGURE 8.21 


Comparison of the eigendecomposition-based frequency estimation methods: (a) Pisarenko harmonic decomposition, (b) MUSIC, 
(c) eigenvector method, and (d) minimum-norm method. 
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8.5.5 ESPRIT Algorithm 


A frequency estimation technique that is built upon the same principles as other subspace methods but further exploits 
a deterministic relationship between subspaces is the estimation of signal parameters via rotational invariance 
techniques (ESPRIT) algorithm. This method differs from the other subspace methods discussed so far in this chapter 
in that the signal subspace is estimated from the data matrix X rather than the estimated correlation matrix R, . The 


essence of ESPRIT lies in the rotational property between staggered subspaces that is invoked to produce the 
frequency estimates. In the case of a discrete-time signal or time series, this property relies on observations of the 
signal over two identical intervals staggered in time. This condition arises naturally for discrete-time signals, provided 
that the sampling is performed uniformly in time.° We first describe the original, least-squares version of the 
algorithm (Roy et al. 1986) and then extend the derivation to total least-squares ESPRIT (Roy and Kailath 1989), 
which is the preferred method for use. Since the derivation of the algorithm requires an extensive amount of 
formulation and matrix manipulations, we have included a block diagram in Figure 8.22 to be used as a guide through 
this process. 


Unknown 










Signal model 


P j 
s(n)= = ae)? Syn 
pal 


Time-window 
signal vector model 


Matching 


i F $p 
signal Sp= Sn 
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Data matrix | 


Compute 
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subspaces staggered subspaces 


FIGURE 8.22 
Block diagram demonstrating the flow of the ESPRIT algorithm starting from the data matrix through the frequency estimates. 


Consider a single complex exponential s(n) =e!" with complex amplitude œ and frequency f. This signal 
has the following property 


s(n #1) = ge?" = s (n)e™ (8.5.42) 
that is, the next sample value is a phase-shifted version of the current value. This phase shift can be represented as a 


rotation on the unit circle e!”% . Recall the time-window vector model from (8.5.4) consisting of a signal s(n), 
made up of complex exponentials, and the noise component w(n) 


This condition is violated in the case of a nonuniformly sampled time series. 
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P s 
x(n) => a,v(f, Je” + w(n) = VO"a+ w(n) = s(n) + w(n) (8.5.43) 
p=l 
where the P columns of matrix V are length-M time-window frequency vectors of the complex exponentials 


V =W Vf) +> VC fed] (8.5.44) 


The vector œ consists of the amplitudes of the complex exponentials @,. On the other hand, 
matrix ® is the diagonal matrix of phase shifts between neighboring time samples of the individual, complex 
exponential components of s(n) 





je o o | 
; 0 ot a 0 
® = diag{¢, @, =, b : f > (8.5.45) 
Lo «+» 0 el 


where ¢,=e”" for p=l, 2, --, P. Since the frequencies of the complex exponentials f, 
completely describe this rotation matrix, frequency estimates can be obtained by finding ®. Let us consider two 
overlapping subwindows of length M —1 within the length M time-window vector. This subwindowing operation is 
illustrated in Figure 8.23. Consider the signal consisting of the sum of complex exponentials 


=| Te | (8.5.46) 
s(n+M —-1) Sua (n+1) 
where s(n) is the length-( M —1) subwindow of s(n), that is, 
Suan) = Vy P" (8.5.47) 
x(n) 
n-M-1 n-1 n 





FIGURE 8.23 
Time-staggered, overlapping windows used by the ESPRIT algorithm. 


Matrix V,,_; is constructed in the same manner as V except its time-window frequency vectors are of length 
M -1, denotedas vy_\(f), 


Vua = Piva ty) Vma fa) = Vualo) (8.5.48) 


Recall that s(n) is the scalar signal made up of the sum of complex exponentials at time n. Using the relation in 
(8.5.47), we can define the matrices 


V= Vy" V,=Vye™ (8.5.49) 


where V and V, correspond to the unstaggered and staggered windows, that is, 
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v-| ” ley (8.5.50) 
ak.» ok V, 


Clearly, by examining (8.5.49), these two matrices of time-window frequency vectors are related as 
V, =V ® (8.5.51) 


Note that each of these two matrices spans a different, though related, (M — 1) -dimensional subspace. 

Now suppose that we have a data matrix X from (8.5.20) with N data records of the length-M time-window 
vector signal x(n). Using the singular value decomposition (SVD) discussed in Chapter 7, we can write the data 
matrix as 

X = LzU" (8.5.52) 
where Lisan NXN matrix of left singular vectors and U isan M XM matrix of right singular vectors. Both of 
these matrices are unitary; that is, V"V =J and U”U =1..The matrix £ has dimensions NXM consisting of 
singular values on the main diagonal ordered in descending magnitude. The squared magnitudes of the singular 
values are equal to the eigenvalues of R scaled by a factor of N from (8.5.21), and the columns of U are their 
corresponding eigenvectors. Thus, U forms an orthonormal basis for the underlying M-dimensional vector space. This 
subspace can be partitioned into signal and noise subspaces as 


U =(U,|U,] (8.5.53) 


where U, is the matrix of right-hand singular vectors corresponding to the singular values with the P largest 
magnitudes. Note that since the signal portion consists of the sum of complex exponentials modeled as time-window 
frequency vectors v( f), all these frequency vectors, for f = fi, f2, ++, fp, must also lie in the signal subspace. 
As a result, the matrices V and U, span the same subspace. Therefore, there exists an invertible transformation T that 
maps U, into V, that is, 


V=UT (8.5.54) 


The transformation T is never solved for in this derivation, but instead is only formulated as a mapping between these 
two matrices within the signal subspace. 
Proceeding as we did with the matrix V in (8.5.50), we can partition the signal subspace into two smaller 


(M —1)-dimensional subspaces as 
kke 
U -| “i |-| i (8.5.55) 
S| ek ok U, 


where U and U3 correspond to the unstaggered and staggered subspaces, respectively. Since V; and V, correspond to 
the same subspaces, the relation from (8.5.54) must also hold for these subspaces 


V,=UT  V,=U.T (8.5.56) 


The staggered and unstaggered components of the matrix V in (8.5.50) are related through the subspace rotation ® 
in (8.5.51). Since the matrices U; and U, also span these respective, related subspaces, a similar, though different, 
rotation must exist that relates (rotates) U; to U2 


U, =U,¥ (8.5.57) 


where ‘is this rotation matrix. 
Recall that frequency estimation comes down to solving for the subspace rotation matrix ®. We can estimate 
Ø by making use of the relations in (8.5.56) together with the rotations between the staggered signal subspaces in 


7Our notation differs slightly from that introduced in Chapter 8 in order to avoid confusion with the matrix of time-window frequency 
vectors V. 
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(8.5.51) and (8.5.57). In this process, the matrices U; and U, are known from the SVD on data matrix X. First, we 
solve for ¥ from the relation in (8.5.57), using the method of least-squares (LS) from Chapter 7 


Y = (UU) 'U"U, (8.5.58) 
Substituting (8.5.57) into (8.5.56), we have 
V, =U, T =U T (8.5.59) 
Similarly, we can also solve for V2 using the relation in (8.5.51) and substituting (8.5.56) for V; 
V, =V,® =U, T® (8.5.60) 


Thus, equating the two right-hand sides of (8.5.59) and (8.5.60), we have the following relation between the two 
subspace rotations 


WT =T® (8.5.61) 


or equivalently Y=TOT (8.5.62) 
Equations (8.5.61) and (8.5.62) should be recognized as the relationship between eigenvectors and eigenvalues of the 
matrix " (Golub and Van Loan 1996). Therefore, the diagonal elements of ®, ø, for p=1, 2, ---, P, are 
simply the eigenvalues of ¥ . As a result, the estimates of the frequencies are 


=— 2 (8.5.63) 


where £@, is the phase of ¢,. Although the principle behind the ESPRIT algorithm, namely, the 


use of subspace rotations, is quite simple, one can easily get lost in the details of the derivation of the algorithm. Note 
that we have only used simple matrix relationships. An illustrative example of the implementation of ESPRIT in 
MATLAB is given in Example 8.6.4 to help clarify the details of the algorithm. However, first we give a total 
least-squares version of the algorithm, which is the preferred method for use. 

Note that the subspaces U, and U} are both only estimates of the true subspaces that correspond to V; and V2 
respectively, obtained from the data matrix X. The estimate of the subspace rotation was obtained by solving (8.5.57) 
using the LS criterion 


y, =(UF0 ye, (8.5.64) 


This LS solution is obtained by minimizing the errors in an LS sense from the following formulation 

U,+E, =U? (8.5.65) 
where E, is a matrix consisting of errors between U2 and the true subspace corresponding to V2. Note that this LS 
formulation assumes errors only on the estimation of U2 and no errors between U; and the true subspace that it is 
attempting to estimate corresponding to V). Therefore, since U, is also an estimated subspace, a more appropriate 
formulation is 


U,+E,=(U,+E,)¥ (8.5.66) 


where E; is the matrix representing the errors between U, and the true subspace corresponding to V;. A solution to 
this problem, known as total least squares (TLS), is obtained by minimizing the Frobenius norm of the two error 
matrices 


|E, Ely (8.5.67) 


Since the principles of TLS are beyond the scope of this book, we simply give the procedure to obtain the TLS 
solution of ¥ and refer the interested reader to Golub and Van Loan (1996). 

First, form a matrix made up of the staggered signal subspace matrices U, and U2 placed side by side? and 
perform an SVD 


‘Note that this matrix [U,U,]#U, =[U;U}]" from (8.6.55). 
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[U U,]= LeU" (8.5.68) 
We then operate on the 2Px2P matrix U of right singular vectors. This matrix is partitioned into Px P 
quadrants 


U -(? u U "| (8.5.69) 
Unz Un 
The TLS solution for the subspace rotation matrix WY is then 
Pa = Owe (8.5.70) 


The frequency estimates are then obtained from (8.5.62) and (8.5.63) by using Yı, from (8.5.70). Although the 
TLS version of ESPRIT involves slightly more computations, it is generally preferred over the LS version based on 
formulation in (8.5.66). A statistical analysis of the performance of the ESPRIT algorithms is given in Ottersten et al. 
(1991). 


EXAMPLE 8.5.4. In this illustrative example, we demonstrate the use of both the LS and TLS versions of the ESPRIT algorithm 
on a set of complex exponentials in white noise using MATLAB. First, generate a signal s(n) of length N =128 consisting of 
complex exponential signals at normalized frequencies f = 0.1, 0.15, 0.4, and —0.15, all with amplitude œ = 1. Each of the 
complex exponentials is generated by exp (j*2*pi*f£*[0: (N-1)]') ;. The overall signal in white noise with unit power 
(o2, =1) is then 

x=s+ (randn(N,1)+j*randn(N,1)) /sqrt (2) 


We form the data matrix corresponding to (8.5.20) for a time window of length M = 8. The least-squares ESPRIT algorithm 


is then performed as follows: 
[L, S, U]=svd(X) ; 
Us=U((:, 1:2); % signal subspace 
U1=Us (1:M-1) ,:); U2=Us (2:M,:); % signal subspaces 
Psi=U1\U2; % TLS solution for psi 


If we are using the TLS version of ESPRIT, then solve for 
[LL, SS, UU]=svd([U1l U2]); UU12=UU(1:P, (P+1):(2*P)); 
UU22=UU ((p+1):(2*p), (p+) :(2*p)); 
Psi=-UU12*inv (UU22) ; % TLS solution for Psi 
The frequencies are found by computing the phases of the eigenvalues of P, that is, 
phi=eig(Psi) ; % eigenvalues of Psi 
fhat=angle (phi) /(2*pi) ; % frequency estimates 
In both cases, we average over 1000 realizations and obtain average estimated frequencies very close to the true values f = 
0.1, 0.15, 0.4, and — 0.15 used to generate the signals. Routines for both the LS and TLS versions of ESPRIT are provided in 
esprit _ls.mandesprit tls.m. 


8.6 Summary 


In this chapter, we have examined the modeling process for both pole-zero and harmonic signal models. As for all 
signal modeling problems, the procedure begins with the selection of the appropriate model for the signal under 
consideration. Then the signal model is applied by estimating the model parameters from a collection of data samples. 
However, as we have stressed throughout this chapter, nothing is more valuable in the modeling process than specific 
knowledge of the signal and its underlying process in order to assess the validity of the model for a particular signal. 
For this reason, we began the chapter with a discussion of a model building procedure, starting with the choice of the 
appropriate model and the estimation of its parameters, and concluding with the validation of the model. Clearly, if 
the model is not well-suited for the signal, the application of the model becomes meaningless. 

In the first part of the chapter, we considered the application of the parametric signal models that were discussed 
in Chapter 3. The estimation of all-pole models was presented for both direct and lattice structures. Within this 
context, we used various model order selection criteria to determine the order of the all-pole model. However, these 
criteria are not necessarily limited to all-pole models. In addition, the relationship was given between the all-pole 
model and Burg’s method of maximum entropy. Next, we considered the pole-zero modeling. Using a nonlinear 
least-squares technique, a method was presented for estimating the parameters of the pole-zero model. The use of 
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pole-zero models for the purposes of spectral estimation along with their application to speech modeling was also 
considered. 

The latter part of the chapter focused on harmonic signal models, that is, modeling signals using the sum of 
complex exponentials. The harmonic modeling problem becomes one of estimating the frequency of the complex 
exponentials. Then, we discuss some of the more popular harmonic modeling methods. Starting with the Pisarenko 
harmonic decomposition, the first such model, we discuss the MUSIC, eigenvector, root-MUSIC, and minimum- 
norm methods for frequency estimation. All of these methods are based on computing a pseudospectrum or a rooting 
polynomial from an estimated correlation matrix. Finally, we give a brief derivation of the ESPRIT algorithm, both in 
its original LS form and the more commonly used TLS form. 


Problems 


8.1 Consider the random process x(n) described in Example 8.2.3 that is simulated by exciting the system function 
1 
1—2.7607z7' +3.8108z* — 2.65357” +0.92382~* 
using a WGN(0, 1) process. Generate N = 250 samples of the process x(n) . 
(a) Write a MATLAB function that implements the modified covariance method to obtain AR(P) model coefficients and the 
modeling error variance G3 asa function of P, using N samples of x(n) . 
(b) Compute and plot the variance g?,, FPE(P), AIC(P), MDL(P), and CAT(P) for P =1,2,---,15. 
(c) Comment on your results and the usefulness of model selection criteria for the process X(n) . 


H(z)= 


8.2 Consider the Burg approach of minimizing forward-backward LS error €® in (8.2.33). 
(a) Show that by using (8.2.26) and (8.2.27), e€® can be put in the form of (8.2.34). 
(b) By minimizing ¢® with respectto k 
(c) Show that |k3,|<1. 
(d) Show that |k2.,|<|k/S,|<1 where k/S, is defined in (8.2.36). 


show that the expression for the optimum 2, is given by (8.2.35). 


m-ls 


8.3 Generate an AR(2) process using the system function 

1 
1-0.92'+0.81z7 
excited by a WGN(0, 1) process. Illustrate numerically that if we use the full-windowing method, that is, the matrix X in (8.2.8), 
then the PACS estimates {KFM}!_5, {kBM}!\_), and {k2}!_, of Section 8.2 are identical and hence can be obtained by using 
the Levinson-Durbin algorithm. 


A(z)= 


8.4 Generate sample sequences of an AR(2) process 
x(n) = @n) —1.5857x(n — 1) —0.9604x(n — 2) 
where @(n) ~ WGN(0,1).Choose N = 256 samples for each realization. 
(a) Design a first-order optimum linear predictor, and compute the prediction error e(n) . Test the whiteness of the error sequence 
e(n) using the autocorrelation, PSD, and partial correlation methods, discussed in Section 8.1. Show your results as an 
overlay plot using 20 realizations. 


(b) Repeat the above part, using second- and third-order linear predictors. 
(c) Comment on your plots. 


8.5 Generate sample functions of the process 
x(n) = 0.5a(n) + 0.5a@(n - 1) 
where @Xn) ~ WGN(0,1). Choose N = 256 samples for each realization. 
(a) Test the whiteness of x(n) and show your results, using overlay plots based on 10 realizations. 
(b) Process x(n) through the AR(1) filter 
1 


Hojs 
= T0954 
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to obtain y(n). Test the whiteness of y(n) and show your results, using overlay plots based on 10 realizations. 


8.6 The process x(n) contains a complex exponential in white noise, that is, 
x(n) = Ae?" + an) 
where A is a real positive constant, @ is a random variable uniformly distributed over [0,277], a@ is a constant between 0 and 
m, and @(n)~ WGN(0,o2). The purpose of this problem is to analytically obtain a maximum entropy method (MEM) 
estimate by fitting an AR(P) model and then evaluating {a,}{ model coefficients. 
(a) Show that the (P+1)x(P+1) autocorrelation matrix of x(n) is given by 
R, = A’ ee" + ol 
where e=[l e 3% «-- ear. 
(b) By solving autocorrelation normal equations, show that 


a, =[la, -+ ap] 


A? 
1+——— Je 
{ aed 


(c) Show that the MEM estimate based on the above coefficients is given by 





A 
eel ee a 0 | 
o r (P+A™ r 


Re”) = — ak È | 
x |1 A? W, (eX?) ? 


o2+(P+1)A 


where W,(e'”) is the DTFT of the (P+1) length rectangular window. 


8.7 AnAR(2) process y(n) is observed in noise V(N) toobtain x(7), thatis, 
x(n) = y(n) + x(n) v(n) ~ WGN (0, 07) 
where y(n) is uncorrelated with y(n) and 
y(n) = 1.27 y(n—1)—-0.81 y(n —2) + Mn) an) ~ WGN(0, 1) 
(a) Determine and plot the true power spectrum R,(e!”). 
(b) Generate 10 realizations of x(n), each with N =256 samples. Using the LS approach with forward-backward linear 


predictor, estimate the power spectrum for P=2 and øg? = 1. Obtain an overlay plot of this estimate, and compare it with 
the true spectrum. 


(c) Repeat part (b), using g? =10.Comment on the effect of increasing noise variance on spectrum estimates. 


(d) Since the noise variance g? affects only r (0), investigate the effect of subtracting a small amount from r,(Q) on the 
spectrum estimates in part (c). 


8.8 Let x(n) be a random process whose correlation is estimated. The values for the first five lags are r,(0) =1, r,(1) =0.7, 
r.(2) =0.5, r,(3) =0.3, and r,(4) =0. 
(a) Determine and plot the Blackman-Tukey power spectrum estimate. 
(b) Assume that x(n) is modeled by an AP(2) model. Determine and plot its spectrum estimate. 
(c) Now repeat (b) assuming that AP(4) is an appropriate model for x(n) . Determine and plot the spectrum estimate. 


8.9 The narrowband process x(n) is generated using the AP(4) model 


H(z)= See ee 
1+0.98z7'+1.92z7 +0.94z° +0.92z* 


driven by WGN(0, 0.001). Generate 10 realizations, each with N =256 samples, of this process. 

(a) Determine and plot the true power spectrum R,(e!”). 

(b) Using the LS approach with forward linear predictor, estimate the power spectrum for P = 4. Obtain an overlay plot of this 
estimate, and compare it with the true spectrum. 

(c) Repeat part (b) with P = 8 and 12. Provide a qualitative description of your results with respect to model order size. 


(d) Using the LS approach with forward-backward linear predictor, estimate the power spectrum for P = 4. Obtain an overlay plot 
of this estimate. Compare it with the plot in part (b). 


288 


8.10 


8.11 


8.12 


8.13 


8.14 


8.15 
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Consider the following PZ(4, 2) model 
— 1 az z” 
1+0.41z* 
driven by WGN(O,1) to obtain a broadband ARMA process x(n). Generate 10 realizations, each with N = 256 samples, of 
this process. 
(a) Determine and plot the true power spectrum R,(e!”). 
(b) Using the LS approach with forward-backward linear predictor, estimate the power spectrum for P = 12. Obtain an overlay plot 
of this estimate, and compare it with the true spectrum. 


H(z) 


A random process x(n) is given by 
x(n) = cos + 6,)+ @n)—- a@n—-2) + cos ty 0,) 


where œ(n)~ WGN(0,1) and A, and 0, are IID random variables uniformly distributed between 0 and 2% . Generate a 

sample sequence with N = 256 samples. 

(a) Determine and plot the true spectrum R,(e!”). 

(b) Using the LS approach with forward-backward linear predictor, estimate the power spectrum for P=10, 20, and 40 from the 
generated sample sequence. Compare it with the true spectrum. 


Show that, for large values of N, the modeling error variance estimate given by Equation (8.2.38) can be approximated by the 
estimate given by Equation (8.2.39). 


This problem investigates the effect of correlation aliasing observed in LS estimation of model parameters when the AP model is 
excited by discrete spectra. Consider an AP(1) model with pole at z =œ excited by a periodic sequence of period N. Let x(n) 
be the output sequence. 

(a) Show that the correlation at lag 1 satisfies 

N-I 
r (1) =Z 4%, @) (P.1) 
l+a@ 

(b) Using the LS approach, determine the estimate @ asafunctionof œŒ and N. Compute @ for œ =0.9 and N=10. 

(c) Generate x(n), using @=0.95 and the periodic impulse train with N =10. Compute and plot the correlation sequence 
rl), 0 < 1 < N-1, of x(n),. Compare your plot with the AP(1) model correlation for œ =0.95 Comment on your 
observations and discuss why they explain the discrepancy between @ and @. 

(d) Repeat part (c) for N =100 and 1000. Show analytically and numerically that @—>@ as N — œ. 


In this problem, we investigate the equation error method of Section 8.3.1. Consider the PZ(2, 2) model 
x(n) = 0.3x(n —1) + 0.4x(n —2) + @(n) +0.25a@(n — 2) 
Generate N =200 samplesof x(n), using @(n) ~ WGN(0, V10) . Record values of both x(n) and @n). 
(a) Using the residual windowing method, that is, N; =max(P,Q) and N; =N —1, compute the estimates of the above 


model parameters. 


(b) Compute the input variance estimate G?, from your estimated values in part (a). Compare it with the actual value g2, and 
with (8.3.12). 


Consider the following PZ(4, 2) model 
x(n) =1.8766x(n —1) —2.6192.x(n — 2) + 1.6936x(n — 3) —0.8145x(n — 4) 
+a@(n) + 0.05a@(n —1) —0.855a(n — 2) 
excited by @(n) ~ WGN(0, V10 ) . Generate 300 samples of x(n) - 
(a) Assuming the AP(10) model for the data segment, estimate its parameters by using the LS approach described in Section 8.2. 


(b) Generate a plot similar to Figure 8.13 by computing spectra corresponding to the true PZ(4, 2), estimated PZ(4, 2), and 
estimated AP(10) models. Compare and comment on your results. 
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8.26 
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Using matrix notation, show that AZ power spectrum estimation is equivalent to the Blackman-Tukey method discussed in 
Chapter 4. , 


Consider the PZ(4, 2) model given in Problem 8.15. Generate 300 samples of x(n) . 

(a) Fit an AP(5) model to the data and plot the resulting spectrum. 

(b) Fit an AP(10) model to the data and plot the resulting spectrum. 

(c) Fit an AP(50) model to the data and plot the resulting spectrum. 

(d) Compare your plots with the true spectrum, and discuss the effect of model mismatch on the quality of the spectrum. 


Use the supplied (about 50-ms) segment of a speech signal sampled at 8192 samples per second. 

(a) Compute a periodogram of the speech signal (see Chapter 4). 

(b) Using data windowing, fit an AP(16) model to the speech data and compute the spectrum. 

(c) Using the residual windowing, fit a PZ(12, 6) model to the speech data and compute the spectrum. 
(d) Plot the above three spectra on one graph, and comment on the performance of each method. 


One practical approach to spectrum estimation discussed in Section 8.4 is the prewhitening and postcoloring method. 

(a) Develop a MATLAB function to implement this method. Use the forward/backward LS method to determine AP(P) parameters 
and the Welch method for nonparametric spectrum estimation. 

(b) Verify your function on the short segment of the speech segment from Problem 8.18. 

(c) Compare your results with those obtained in Problem 8.18. 


Consider a white noise process with variance O; 2. Find its minimum-variance power spectral estimate. 


Find the minimum-variance spectrum of a first-order all pole model, that is, 
x(n) = —a,x(1—1) + @n) 


The filter coefficient vector for the minimum-variance spectrum estimator is given in (8.5.10). Using Lagrange multipliers, 
discussed in Appendix B, solve this constrained optimization to find this weight vector. 


Using the relationship between the minimum-variance and the all-pole model spectrum estimators in (8.5.22), generate a recursive 
relationship for the minimum-variance spectrum estimators of increasing window length. In other words, write Rae ) in 


terms of Rye?" ) and the all-pole model spectrum estimator RGP’ (e?) in (8.5.20). 


In Pisarenko harmonic decomposition, discussed in Section 8.6.2, we determine the frequencies of the complex exponentials in 
white noise through the use of the pseudospectrum. The word pseudospectrum was used because its value does not correspond to 
an estimated power. Find a set of linear equations that can be solved to find the powers of the complex exponentials. Hint: Use the 
relationship of eigenvalues and eigenvectors R gm =Ang, for m=1,2,---,M. 


For the MUSIC algorithm, we showed a means of using the MUSIC pseudospectrum to derive a polynomial that could be rooted to 
obtain frequency estimates, which is known as root-MUSIC. Find a similar rooting method for the minimum-norm frequency 
estimation procedure. 


The Pisarenko harmonic decomposition, MUSIC, and minimum-norm algorithms yield frequency estimates by computing a 
pseudospectrum using the Fourier transforms of the eigenvectors. However, these pseudospectra do not actually estimate a power. 
Derive the minimum-variance spectral estimator in terms of the Fourier transforms of the eigenvectors and the associated 
eigenvalues. Relate this result to the MUSIC and eigenvector method pseudospectra. 


Show that the pseudospectrum for the MUSIC algorithm is equivalent to the minimum-variance spectrum in the case of an infinite 
signal-to-noise ratio. 


Find a relationship between the minimum-norm pseudospectrum and the MUSIC pseudospectrum. What are the implications of 
this relationship?. 
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8.29 In (8.5.22), we derived a relationship between the minimum-variance spectral estimator and spectrum estimators derived from 
all-pole models of orders 1 to M . Find a similar relationship between the pseudospectra of the MUSIC and minimum-norm 
algorithms that shows that the MUSIC pseudospectrum is a weighted average of minimum-norm pseudospectra. 
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CHAPTER 9 


Adaptive Filters 


In Chapter 1, we discussed different practical applications that demonstrated the need for adaptive filters, pointed out 
the key aspects of the underlying signal operating environment (SOE), and illustrated the key features and types of 
adaptive filters. The defining characteristic of an adaptive filter is its ability to operate satisfactorily, according to a 
criterion of performance acceptable to the user, in an unknown and possibly time-varying environment without the 
intervention of the designer. In Chapter 5, we developed the theory of optimum filters under the assumption that the 
filter designer has complete knowledge of the statistical properties (usually second-order moments) of the SOE. 
However, in real-world applications such information is seldom available, and the most practical solution is to use an 
adaptive filter. Adaptive filters can improve their performance, during normal operation, by learning the statistical 
characteristics through processing current signal observations. 

In this chapter, we develop a mathematical framework for the design and performance evaluation of adaptive 
filters, both theoretically and by simulation. The goal of an adaptive filter is to “find and track” the optimum filter 
corresponding to the same signal operating environment with complete knowledge of the required statistics. In this 
context, optimum filters provide both guidance for the development of adaptive algorithms and a yardstick for 
evaluating the theoretical performance of adaptive filters. We start in Section 9.1 with discussion of a few typical 
application problems that can be effectively solved by using an adaptive filter. The performance of adaptive filters is 
evaluated using the concepts of stability, speed of adaptation, quality of adaptation, and tracking capabilities. These 
issues and the key features of an adaptive filter are discussed in Section 9.2. Since most adaptive algorithms originate 
from deterministic optimization methods, in Section 9.3 we introduce the family of steepest-descent algorithms and 
study their properties. Sections 9.4 and 9.5 provide a detailed discussion of the derivation, properties, and applications 
of the two most important adaptive filtering algorithms: the least mean square (LMS) and the recursive least-squares 
(RLS) algorithms. Section 9.6 provides fast implementations of the RLS algorithm for the FIR filtering case. The 
development of the later algorithms is a result of the shift invariance of the data stored in the memory of the FIR filter. 
Finally, in Section 9.7 we provide a concise introduction to the tracking properties of the LMS and the RLS 
algorithms. 


9.1 Typical Applications of Adaptive Filters 


As we have already seen in Chapter 1, many practical applications cannot be successfully solved by using fixed 
digital filters because either we do not have sufficient information to design a digital filter with fixed coefficients or 
the design criteria change during the normal operation of the filter. Most of these applications can be successfully 
solved by using a special type of “smart” filters known collectively as adaptive filters. The distinguishing feature of 
adaptive filters is that they can modify their response to improve their performance during operation without any 
intervention from the user. 

The best way to introduce adaptive filters is with some applications for which they are well suited. These and 
other applications are discussed in greater detail in the sequel as we develop the necessary background and tools. 


9.1.1 Echo Cancelation in Communications 


An echo is the delayed and distorted version of an original signal that returns to its source. In some applications (radar, 
sonar, or ultrasound), the echo is the wanted signal; however, in communication applications, the echo is an unwanted 
signal that must be eliminated. There are two types of echoes in communication systems: (1) electrical or line echoes, 
which are generated electrically due to impedance mismatches at points along the transmission medium, and (2) 
acoustic echoes, which result from the reflection of sound waves and acoustic coupling between a microphone and a 
loudspeaker. 
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Here we focus on electrical echoes in voice communications; electrical echoes in data communications are 
discussed in Section 9.4.4, and acoustic echoes in teleconferencing and hands-free telephony were discussed in 
Section 1.4.1. 


Four-wire connection 








Two-wire 
connection 





Echo of A's 
speech 


FIGURE 9.1 
Echo generation in a long-distance telephone network. 


Electrical echoes are observed on long-distance telephone circuits. A simplified form of such a circuit, which is 
sufficient for the present discussion, is shown in Figure 9.1. The local links from the customer to the telephone office 
consist of bidirectional two-wire connections, whereas the connection between the telephone offices is a four-wire 
carrier facility that may include a satellite link. The conversion between two-wire and four-wire links is done by 
special devices known as hybrids. An ideal hybrid should pass (1) the incoming signal to the two-wire output without 
any leakage into its output port and (2) the signal from the two-wire circuit to its output port without reflecting any 
energy back to the two-wire line (Sondhi and Berkley 1980). In practice, due to impedance mismatches, the hybrids 
do not operate perfectly. As a result, some energy on the incoming branch of the four-wire circuit leaks into the 
outgoing branch and returns to the source as an echo (see Figure 9.1). This echo, which is usually 11 dB down from 
the original signal, makes it difficult to carry on a conversation if the round-trip delay is larger than 40 ms. Satellite 
links, as a consequence of high altitude, involve round-trip delays of 500 to 600 ms. 

The first devices used by telephone companies to control voice echoes were echo suppressors. Basically, an echo 
suppressor is a voice-activated switch that attempts to impose an open circuit on the return path from listener to talker 
when the listener is silent (see Figure 9.2). The main problems with these devices are speech clipping during 
double-talking and the inability to effectively deal with round-trip delays longer than 100 ms (Weinstein 1977). 


Echo 
suppressor 


; NX 

: Control Hybrid 

i B f 

: v 

: Speech 
ee from B 


Rvaesscasssdteuccnscossecescesce! 





FIGURE 9.2 
Principle of echo suppression. 


The problems associated with echo suppressors could be largely avoided if we could estimate the transmission 
path from point C to point D (see Figure 9.3), which is known as the echo path. If we knew the echo path, we could 
design a filter that produced a copy or replica of the echo signal when driven by the signal at point C. Subtraction of 
the echo replica from the signal at point D will eliminate the echo without distorting the speech of the second talker 
that may be present at point D. The resulting device, shown in Figure 9.3, is known as an echo canceler. 
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FIGURE 9.3 
Principle of echo cancelation. 


In practice, the channel characteristics are generally not known. For dial-up telephone lines, the channel differs 
from call to call, and the characteristics of radio and microwave channels (phase perturbations, fading, etc.) change 
significantly with time. Therefore, we cannot design and use a fixed echo canceler with satisfactory performance for 
all possible connections. There are two possible ways around this problem: 

1. Design a compromise fixed echo canceler based on some “average” echo path, assuming that we have sufficient 
information about the connections to be seen by the canceler. 

2. Design an adaptive echo canceler that can “learn” the echo path when it is first turned on and afterward “tracks” its 
variations without any intervention from the designer. Since an adaptive canceler matches the echo path for any 
given connection, it performs better than a compromise fixed canceler. 


We stress that the main task of the canceler is to estimate the echo signal with sufficient accuracy; the estimation 
of the echo path is simply the means of achieving this goal. The performance of the canceler is measured by the 
attenuation, in decibels, of the echo, which is known as echo return loss enhancement. The adaptive echo canceler 
achieves this goal by modifying its response, using the residual echo signal in an as yet unspecified way. 

Adaptive echo cancelers are widely used in voice telecommunications, and the international standards 
organization CCITT has issued a set of recommendations (CCITT G. 165) that outlines the basic requirements for 
echo cancelers. More details can be found in Weinstein (1977) and Murano et al. (1990). 


9.1.2 Linear Predictive Coding 


The efficient storage and transmission of analog signals using digital systems requires the minimization of the 
number of bits necessary to represent the signal while maintaining the quality to an acceptable level according to a 
certain criterion of performance. The conversion of an analog (continuous-time, continuous-amplitude) signal to a 
digital (discrete-time, discrete-amplitude) signal involves two processes: sampling and quantization. Sampling 
converts a continuous-time signal to a discrete-time signal by measuring its amplitude at equidistant intervals of time. 
Quantization involves the representation of the measured continuous amplitude by using a finite number of symbols. 
Therefore, a small range of amplitudes will use the same symbol (see Figure 9.4). A code word is assigned to each 
symbol by the coder. When the digital representation is used for digital signal processing, the quantization levels and 
the corresponding code words are uniformly distributed. However, for coding applications, levels may be 
nonuniformly distributed to match the distribution of the signal amplitudes. 
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FIGURE 9.4 
Partitioning of the range of a 3-bit (eight-level) uniform quantizer. 


For all practical purposes, the range of a quantizer is equal to Rọ = A-2”, where A is the quantization step size 
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and B is the number of bits, and should cover the dynamic range of the signal. The difference between the 
unquantized sample x(n) and the quantized sample <X(n) , that is, 


e(n) = X(n) — x(n) (9.1.1) 


is known as the quantization error and is always in the range —A/2<e(n) < A/2. If we define the signal-to-noise 
ratio by 


a Efx?’ (n) (9.1.2) 
E{e*(n)} 
it can be shown (Rabiner and Schafer 1978; Jayant and Noll 1984) that 
SNR(dB) = 6B (9.1.3) 


which states that each added binary digit increases the SNR by 6 dB. 

For a fixed number of bits, decreasing the dynamic range of the signal (and therefore the range of the quantizer) 
decreases the required quantization step and therefore the average quantization error power. Therefore, we can 
increase the SNR by reducing the dynamic range, or equivalently the variance of the signal. If the signal samples are 
significantly correlated, the variance of the difference between adjacent samples is smaller than the variance of the 
original signal. Thus, we can improve the SNR by quantizing this difference instead of the original signal. 

The differential quantization concept is exploited by the linear predictive coding (LPC) system illustrated in 
Figure 9.5. The quantized signal is the difference 


d(n) = x(n) — x(n) (9.1.4) 
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FIGURE 9.5 


Block diagram of a linear predictive coding system: (a) coder and (b) decoder. 


where x(n) is an estimate or prediction of the signal x(n) obtained by the predictor using a quantized version 
&(n) = X(n)+d(n) (9.1.5) 
of the original signal (see Figure 9.5). If the quantization error of the difference signal is 
e,(n) =d(n)-d(n) (9.1.6) 


we obtain X(n) = x(n) +e,(n) (9.1.7) 
using(9.1.4) and (9.1.5). The significance of (9.1.7) is that the quantization error of the original signal is equal to the 
quantization error of the difference signal, independently of the properties of the predictor. Note that if c'(n) =c(n), 
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that is , there are no transmission or storage errors, then the signal reconstructed by the decoder is %’(n) = X(n) . If the 
prediction is good, the dynamic range of d(n) should be smaller than the dynamic range of x(n), resulting in a smaller 
quantization noise for the same number of bits or the same quantization noise with a smaller number of bits. The 
performance of the LPC system depends on the accuracy of the predictor. In most practical applications, we use a 
linear predictor that forms an estimate (prediction) X(n) of the present sample x(n) as a linear combination of the M 
past samples, that is, 


M 
¥(n) = )a,8(n-k) (9.1.8) 
k=l . 
The coefficients {a,}”“ of the linear predictor are determined by exploiting the correlation between adjacent 
samples of the input signal with the objective to make the prediction error as small as possible. Since the statistical 
properties of the signal x(n) are unknown and change with time, we cannot design an optimum fixed predictor. 
The established practical solution uses an adaptive linear predictor that automatically adjusts its coefficients to 
compute a “good” prediction at each time instant. A detailed discussion of adaptive linear prediction and its 
application to audio, speech, and video signal coding is provided in Jayant and Noll (1984). 


9.1.3 Noise Cancelation 


In Section 1.4.1 we discussed the concept of active noise control using adaptive filters. We now provide a theoretical 
explanation for the general problem of noise canceling using multiple sensors. The principle of general noise 
cancelation is illustrated in Figure 9.6. The signal of interest s(n) is corrupted by uncorrelated additive noise yv,(n), 
and the combined signal s(n)+v,(m) provides what is known as primary input. A second sensor, located at a 
different point, acquires a noise v,(n) (reference input) that is uncorrelated with the signal s(n) but correlated 
with the noise v,(n). If we can design a filter that provides a good estimate f(n) of the noise yv,(n), by exploiting 
the correlation between v,(n) and v,(n), then we could recover the desired signal by subtracting }(n) =~ v,(n) 

from the primary input. 
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FIGURE 9.6 
Principle of adaptive noise cancellation using a reference input. 


Let us assume that the signals s(n),v,(n), and v,(n) are jointly wide-sense stationary with zero mean values. 
The “clean” signal is given by the error 
e(n) = s(n) +[v,(n) — H(n)] 
where y(n) depends on the filter structure and parameters. The MSE is given by 


Ef] eln) P} = Ef] sm) P} + El] y(n) — Fm) FP} 


because the signals s(n) and v,(n)— p(n) are uncorrelated. Since the signal power is not influenced by the filter, 
if we design a filter that minimizes the total output power F{| e(n) |}, then that filter will minimize the output noise 
power E{|v,(n)— $(n)|?}. Therefore, $(n) will be the MMSE estimate of the noise v,(n), and the canceler 
maximizes the output signal-to-noise ratio. If we know the second-order moments of the primary and reference inputs, 
we can design an optimum linear canceler using the techniques discussed in Chapter 5. However, in practice, the 
design of an optimum canceler is not feasible because the required statistical moments are either unknown or 
time-varying. Once again, a successful solution can be obtained by using an adaptive filter that automatically adjusts 
its parameters to obtain the best possible estimate of the interfering noise (Widrow et al. 1975). 
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9.2 Principles of Adaptive Filters 


In this section, we discuss a mathematical framework for the analysis and performance evaluation of adaptive 
algorithms. The goal is to develop design guidelines for the application of adaptive algorithms to practical problems. 
The need for adaptive filters and representative applications that can benefit from their use have been discussed in 
Sections 1.4.1 and 9.1. 


9.2.1 Features of Adaptive Filters 


The applications we have discussed are only a sample from a multitude of practical problems that can be successfully 
solved by using adaptive filters, that is, filters that automatically change their characteristics to attain the right 
response at the right time. Every adaptive filtering application involves one or more input signals and a desired 
response signal that may or may not be accessible to the adaptive filter. We collectively refer to these signals as the 
signal operating environment (SOE) of the adaptive filter. Every adaptive filter consists of three modules (see Figure 
9.7): 
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FIGURE 9.7 
Basic elements of a general adaptive filter. 


Filtering structure. This module forms the output of the filter using measurements of the input signal or signals. 
The filtering structure is linear if the output is obtained as a linear combination of the input measurements; 
otherwise it is said to be nonlinear. For example, the filtering module can be an adjustable finite impulse 
response (FIR) digital filter implemented with a direct or lattice structure or a recursive filter implemented 
using a cascade structure. The structure is fixed by the designer, and its parameters are adjusted by the 
adaptive algorithm. 

Criterion of performance (COP). The output of the adaptive filter and the desired response (when available) are 
processed by the COP module to assess its quality with respect to the requirements of the particular 
application. The choice of the criterion is a balanced compromise between what is acceptable to the user of 
the application and what is mathematically tractable; that is, it can be manipulated to derive an adaptive 
algorithm. Most adaptive filters use some average form of the square error because it is mathematically 
tractable and leads to the design of useful practical systems. 

Adaptation algorithm. The adaptive algorithm uses the value of the criterion of performance, or some function 
of it, and the measurements of the input and desired response (when available) to decide how to modify the 
parameters of the filter to improve its performance. The complexity and the characteristics of the adaptive 
algorithm are functions of the filtering structure and the criterion of performance. 


The design of any adaptive filter requires some generic a priori information about the SOE and a deep 
understanding of the particular application. This information is needed by the designer to choose the criterion of 
performance and the filtering structure. Clearly, unreliable a priori information and/or incorrect assumptions about the 
SOE can lead to serious performance degradations or even unsuccessful adaptive filter applications. The conversion 
of the performance assessment to a successful parameter adjustment strategy, that is, the design of an adaptive 
algorithm, is the most difficult step in the design and application of adaptive filters. 

If the characteristics of the SOE are constant, the goal of the adaptive filter is to find the parameters that give the 
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best performance and then stop the adjustment. The initial period, from the time the filter starts its operation until the 
time it gets reasonably close to its best performance, is known as the acquisition or convergence mode. However, 
when the characteristics of the SOE change with time, the adaptive filter should first find and then continuously 
readjust its parameters to track these changes. In this case, the filter starts with an acquisition phase that is followed 
by a tracking mode. 

A very influential factor in the design of adaptive algorithms is the availability of a desired response signal. We 
have seen that for certain applications, the desired response may not be available for use by the adaptive filter. 
Therefore, the adaptation must be performed in one of two ways: 
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FIGURE 9.8 
Basic elements of a supervised adaptive filter. 


Supervised adaptation. At each time instant, the adaptive filter knows in advance the desired response, 
computes the error (i.e., the difference between the desired and actual response), evaluates the criterion of 
performance, and uses it to adjust its coefficients. In this case, the structure in Figure 9.7 is simplified to that 
of Figure 9.8. 

Unsupervised adaptation. When the desired response is unavailable, the adaptive filter cannot explicitly form 
and use the error to improve its behavior. In some applications, the input signal has some measurable 
property (i.e., constant envelope) that is lost by the time it reaches the adaptive filter. The adaptive filter 
adjusts its parameters in such a way as to restore the lost property of the input signal. The property restoral 
approach to adaptive filtering was introduced in Treichler et al. (1987). In some other applications (e.g., 
digital communications) the basic task of the adaptive filter is to classify each received pulse to one of a finite 
set of symbols. In this case we basically have a problem of unsupervised classification (Fukunaga 1990). 


In this chapter we focus our discussion on supervised adaptive filters, that is, filters that have access to a desired 
response signal. 


9.2.2 Optimum versus Adaptive Filters 


We have mentioned several times that the theory of stochastic processes provides the mathematical framework for the 
design and analysis of optimum filters. In Chapter 5, we introduced filters that are optimum according to the MSE 
criterion of performance; and in Chapter 6, we developed algorithms and structures for their efficient design and 
implementation. However, optimum filters are a theoretical tool and cannot be used in practical applications because 
we do not know the statistical quantities (e.g., second-order moments) that are required for their design. Adaptive 
filters can be thought as the practical counterpart of optimum filters: They try to reach the performance of optimum 
filters by processing measurements of the SOE in real time, which makes up for the lack of a priori statistics. 

For this analysis, we consider the general case of a linear combiner that includes filtering and prediction as 
special cases. However, for convenience we use the terms filters and filtering. We remind the reader that, from a 
mathematical point of view, the key difference between a linear combiner and an FIR filter or predictor is the shift 
invariance (temporal ordering) of the input data vector. This difference, which is illustrated in Figure 9.9, also has 
important implications in the implementation of adaptive filters. To this end, suppose that the SOE is comprised of M 
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input signals x,(n,¢) and a desired response signal y(n,¢), which are sample realizations of random sequences.” 
Then the estimate of y(n,¢) is computed by using the linear combiner 


Hn, $) => of (n) x, (n, ) £c" (n)x(n, £) (9.2.1) 
k=l 
where e(n) =[c,(n) c,(n) > Cy (nf (9.2.2) 
is the coefficient vector and 
x(n, 6) =[x(n,¢) x(n, 6) «+ Xy (n, (9.2.3) 
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FIGURE 9.9 
Illustration of the difference of the input signal between (a) a multiple-input linear combiner and (b) a single-input FIR filter. 


is the input data vector. For single-sensor applications, the input data vector is shift-invariant 


x(n)=[x(n,¢) x(n—1,¢) +++ x(n-M +1,6] (9.2.4) 
and the linear combiner takes the form of the FIR filter 


For clarity, in this section only, we include the dependence on Ç to denote random variables. 
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M-I 
H(n, 6) = > h(n,k)x(n-k,¢) êc" (n)x(n, £) (9.2.5) 
k=0 


where c,(n) =h*(n,k) are the samples of the impulse response at time n. 


Optimum filters. If we know the second-order moments of the SOE, we can design an optimum filter ¢,(n) by 
solving the normal equations 


R(n)c,(n) =d(n) (9.2.6) 
where R(n)=E{x(n, Ox"(n, O (9.2.7) 
and d(n)=E{x(n, Qy*(n, O} (9.2.8) 


are the correlation matrix of the input data vector and the cross-correlation between the input data vector and the 
desired response, respectively. During its normal operation, the optimum filter works with specific realizations of the 
SOE, that is, 


$ (n, 6) =c; (n)x(n, 6) (9.2.9) 


E,(n,f) = y(n, f)— §,(n, 0) (9.2.10) 


where $ (n,¢) is the optimum estimate and e,(n,¢) is the optimum instantaneous error [see 


Figure 9.10(a)]. However, the filter is optimized with respect to its average performance across all possible 
realizations of the SOE, and the MMSE 


P(n) = E{|£,(n,¢) P} =P,(n)-d" (nyc, (n) (9.2.11) 


shows how well the filter performs on average. Also, we emphasize that the optimum coefficient vector is a 
nonrandom quantity and that the desired response is not essential for the operation of the optimum filter [see Equation 
(9.2.9)]. 
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FIGURE 9.10 
Illustration of the difference in operation between (a) optimum filters and (b) adaptive filters. 


If the SOE is stationary, the optimum filter is computed once and is used with all realizations 
{x(n, ¢), y(n, )}. For nonstationary environments, the optimum filter design is repeated at every time instant n 
because the optimum filter is time-varying. 


Adaptive filters. In most practical applications, where the second-order moments R(n) and d(n) are 
unknown, the use of an adaptive filter is the best g If the SOE is ergodic, we have 


R= lim T 2 x(n, £)x"(n, 0) (9.2.12) 





d = lim JE x(n, ¢)y"(n,¢) (9.2.13) 


Noe TF 
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because ensemble averages are equal to time averages (see Section 2.1). If we collect a sufficient amount of data 
{x(n, €), y(n, €)}0" , we can obtain an acceptable estimate of the optimum filter by computing the estimates 


P 1s 
Ês) =—} x(n, O)x"(0, 6) (9.2.14) 

N n=0 

, 1 N-I i 

dvQ)=—> x(n, Ty", 8) (9.2.15) 

N n=0 

by time-averaging and then solving the linear system 

Rv ey (9) =dy(S) (9.2.16) 


The obtained coefficients can be used to filter the data in the interval 0 <n < N—1 or to start filtering the data for 
n=WN, on a sample-by-sample basis, in real time. This procedure, which we called block adaptive filtering in 
Chapter 7, should be repeated each time the properties of the SOE change significantly. Clearly, block adaptive filters 
cannot track statistical variations within the operating block and cannot be used in all applications. 

Indeed, there are applications, for example, adaptive equalization, in which each input sample should be 
processed immediately after its observation and before the arrival of the next sample. In such cases, we should use a 
sample-by-sample adaptive filter that starts filtering immediately after the observation of the pair {x(0), y(0)} 
using a “guess” c(—1) for the adaptive filter coefficients. Usually, the initial guess c(—1) is a very poor estimate 
of the optimum filter c, . However, this estimate is improved with time as the filter processes additional pairs of 
observations. 

As we discussed in Section 9.2.1, an adaptive filter consists of three key modules: an adjustable filtering 
structure that uses input samples to compute the output, the criterion of performance that monitors the performance of 
the filter, and the adaptive algorithm that updates the filter coefficients. The key component of any adaptive filter is 
the adaptive algorithm, which is a rule to determine the filter coefficients from the available data x(n, G, ) and 
y(n, al ) [see Figure 9.10(b)]. The dependence of c(n,¢) on the input signal makes the adaptive filter a nonlinear 
and time-varying stochastic system. 

The data available to the adaptive filter at time n are the input data vector x(n,¢), the desired response 
y(n,¢), and the most recent update c(n—1, g ) of the coefficient vector. The adaptive filter, at each time n, 
performs the following computations: 

1. Filtering: 


$(n, 6) =e" (n-1,0) x(n, 6) (9.2.17) 
2. Error formation: 
e(n,) = y(n, 6) — 3(n, ¢) (9.2.18) 
3. Adaptive algorithm: 
c(n,¢)=e(n—-1,¢)+ Ac{x(n, 6), e(n, [)} (9.2.19) 


where the increment or correction term Ac(n,¢) is chosen to bring c(n,¢) close to co, with the passage of time. If 


we can successively determine the corrections Ac(n,¢) so that c(n,¢)=c,, that is, e(n, 6) =c, | < 6, for some 
n> N,, we obtain a good approximation for c, by avoiding the explicit averagings (9.2.14), (9.2.15), and the solution 
of the normal equations (9.2.16). A key requirement is that Ac(n,¢) must vanish if the error e(n,¢) vanishes. 
Hence, e(n,¢) plays a major role in determining the increment Ac(n,¢). 

We notice that the estimate }(n,¢) of the desired response y(n,¢) is evaluated using the current input 
vector x(n,¢) and the past filter coefficients c(n—1,¢). The estimate $(n,¢) and the corresponding error 
e(n,¢) can be considered as predicted estimates compared to the actual estimates that would be evaluated using the 
current coefficient vector c(n,¢) . Coefficient updating methods that use the predicted error e(n,¢) are known as 
a priori type adaptive algorithms. 

If we use the actual estimates, obtained using the current estimate c(n,¢) of the adaptive filter coefficients, 
we have 


1. Filtering: 
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5(n, G) =e" (n, ¢) x(n, ¢) (9.2.20) 
2. Error formation: 
e(n, 6) = y(n, g)—9,(n, 0) (9.2.21) 
3. Adaptive algorithm: 
e(n,¢) =e(n—1,0)+ Ac{x(n,¢),€(n, 5} (9.2.22) 


which are known as a posteriori type adaptive algorithms. The terms a priori and a posteriori were introduced in 
Carayannis et al. (1983) to emphasize the use of estimates evaluated before or after the updating of the filter 
coefficients. The difference between a priori and a posteriori errors and their meanings will be further clarified when 
we discuss adaptive least-squares filters in Section 9.5. The timing diagram for the above two algorithms is shown in 
Figure 9.11. 


x(n) x(n+1) 
c(n—-1) y(n) eln) e(n) €n) — y(n +1) 
nT (n+1)T Time 


FIGURE 9.11 
Timing diagrams for a priori and a posteriori adaptive algorithms. 


In conclusion, the objective of an adaptive filter is to use the available data at time n , namely, 
{x(n,¢), y(n, ¢), c(n—1,¢)}, to update the “old” coefficient vector c(n—-1,¢) toa “new” estimate c(n,¢) so 
that c(n,¢) is closer to the optimum filter vector c,(n) and the output (n) is a better estimate of the desired 
response y(n). Most adaptive algorithms have the following form: 


(9.2.23) 


ka a g W pia id iene g í error ) 


vector vector vector signal 
where the error signal is the difference between the desired response and the predicted or actual outputs of the 
adaptive filter. One of the fundamental differences among the various algorithms is the optimality of the used 


adaptation gain vector and the amount of computation required for its evaluation. 


9.2.3 Stability and Steady-State Performance of Adaptive Filters 


We now address the issues of stability and performance of adaptive filters. Since the goal of an adaptive filter 
c(n,¢) is first to find and then track the optimum filter ¢,(n) as quickly and accurately as possible, we can 
evaluate its performance by measuring some function of its deviation 


é(n, 6) =e(n, 6) -c¢,(n) (9.2.24) 


from the corresponding optimum filter. Clearly, an acceptable adaptive filter should be stable in the bounded-input 
bounded-output (BIBO) sense, and its performance should be close to that of the associated optimum filter. The 
analysis of BIBO stability is extremely difficult because adaptive filters are nonlinear, time-varying systems working 
in a random SOE. The performance of adaptive filters is primarily measured by investigating the value of the MSE as 
a function of time. To discuss these problems, first we consider an adaptive filter working in a stationary SOE, and 
then we extend our discussion to a nonstationary SOE. 


Stability 

The adaptive filter starts its operation at time, say, n=0 , and by processing the observations 
{x(n,¢), y(n, ¢)}§ generates a sequence of vectors {c(n,¢)}g using the adaptive algorithm. Since the FIR 
filtering structure is always stable, the output or the error of the adaptive filter will be bounded if its coefficients are 
always kept close to the coefficients of the associated optimum filter. However, the presence of the feedback loop in 
every adaptive filter (see Figure 9.10) raises the issue of stability. In a stationary SOE, where the optimum filter €, is 
constant, convergence of c(n,¢) toc, as n— œ will guarantee the BIBO stability of the adaptive filter. For a 
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specific realization ¢ , the kth component c,(n,¢) or the norm llc (n,¢ )| of the vector c(n,¢) is a sequence of 
numbers that might or might not converge. Since the coefficients c,(n,¢) are random, we must use the concept 
of stochastic convergence (Papoulis 1991). 

We say that a random sequence converges everywhere if the sequence c,(n,¢) converges for every ¢ , that 
is, 


lime, (n, 0) =¢,4($) (9.2.25) 


where the limit c,,(¢) depends, in general, on ¢ . Requiring the adaptive filter to converge to co for every 
possible realization of the SOE is both hard to guarantee and not necessary, because some realizations may have very 
small or zero probability of occurrence. 

If we wish to ensure that the adaptive filter converges for the realizations of the SOE that may actually occur, we 
can use the concept of convergence almost everywhere. We say that the random sequence c,(n,¢) converges 
almost everywhere or with probability 1 if 


Pilim | ¢,(n,)—¢,.(¢) FO} =1 (9.2.26) 


which implies that there can be some sample sequences that do not converge, which must occur with probability zero. 
Another type of stochastic convergence that is used in adaptive filtering is defined by 


lim E{|¢,(1,6)—¢,. }= lim E{| é,(n, ¢) P}=0 (9.2.27) 


and is known as convergence in the MS sense. The primary reason for the use of mean square (MS) convergence is 
that unlike the almost-everywhere convergence, it uses only one sequence of numbers that takes into account the 
averaging effect of all sample sequences. Furthermore, it uses second-order moments for verification and has an 
interpretation in terms of power. Convergence in MS does not imply—nor is implied by—convergence with 
probability 1. Since 


E{| č(n, ¢) P} = | E{é,(n, g)} i + var{é,(n, o)} (9.2.28) 
ô ô oS 
if we can show that E{é,(n)}>0 as n—>œ and var{é,(n,¢)} is bounded for all m, we can ensure 
convergence in MS. In this case, we can say that an adaptive filter that operates in a stationary SOE is an 
asymptotically stable filter. 


Performance measures 

In theoretical investigations, any quantity that measures the deviation of an adaptive filter from the 
corresponding optimum filter can be used to evaluate its performance. 
The mean square deviation (MSD) 


a(n) Ê Efe, E) -e |} = Ete, OI} (9.2.29) 


measures the average distance between the coefficient vectors of the adaptive and optimum filters. Although the 
MSD is not measurable in practice, it is useful in analytical studies. Adaptive algorithms that minimize A(n) for 
each value of n are known as algorithms with optimum learning. 

In Section 5.2.2 we showed that if the input correlation matrix is positive definite, any deviation, say, ¢(n), of 
the optimum filter coefficients from their optimum setting increases the mean square error (MSE) by an amount equal 
to ¢"(n)REé(n), known as excess MSE (EMSE). In adaptive filters, the random deviation é(n,¢) from the 
optimum results in an EMSE, which is measured by the ensemble average of ¢"(n,¢)RE€(n,¢). For a posteriori 
adaptive filters, the MSE can be decomposed as 


P(n) E{] eln, 6) $} = P(n) +P, (n) (9.2.30) 
where P(n) isthe EMSEand P,(n) isthe MMSE given by 


We recall that a sequence of real nonrandom numbers 4), 4,,4,,... converges to a number a if and only if for every positive number 
6 there exists a positive integer N, such that forall n>N,;,wehave |a, —a |< ô. This is abbreviated by lim, _,.a, =a. 


n=% “n 
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P(n) È E{|£,(n,¢) P} (9.2.31) 


with E,(n, 5) = y(n,¢)- c (n)x(n,%) (9.2.32) 
as the a posteriori optimum filtering error. Clearly, the a posteriori EMSE P(n) is given by 


P: (n) È P'(n)- P(n) (9.2.33) 


For a priori adaptive algorithms, where we use the “old” coefficient vector c(n—1,¢), it is more appropriate to use 
the a priori EMSE given by 


P(n) = P(n) — P(n) (9.2.34) 
where P(n) = E{je,(n,f) P} (9.2.35) 
and P (n) E{le,(n,¢) 7} (9.2.36) 
with e,(n,€)= y(n, €)-c% (n-1)x(n, £) (9.2.37) 


as the a priori optimum filtering error. If the SOE is stationary, we have €,(n,¢) =e,(n,¢); that is, the optimum a 
priori and a posteriori errors are identical. 
The dimensionless ratio 


; P 
anmè o mineta (9.2.38) 
P,(n) P,(n) 
known as misadjustment, is a useful measure of the quality of adaptation. Since the EMSE is always positive, there is 
no adaptive filter that can perform (on the average) better than the corresponding optimum filter. In this sense, 
we can say that the excess MSE or the misadjustment measures the cost of adaptation. 


Acquisition and tracking 

Plots of the MSD, MSE, or —/(n) as a function of n , which are known as learning curves, characterize the 
performance of an adaptive filter and are widely used in theoretical and experimental studies. When the adaptive filter 
starts its operation, its coefficients provide a poor estimate of the optimum filter and the MSD or the MSE is very 
large. As the number of observations processed by the adaptive filter increases with time, we expect the quality of the 
estimate c(n,¢) to improve, and therefore the MSD and the MSE to decrease. The property of an adaptive filter to 
bring the coefficient vector c(n,¢) close to the optimum filter C, , independently of the initial condition c(—1) 
and the statistical properties of the SOE, is called acquisition. During the acquisition phase, we say that the adaptive 
filter is in a transient mode of operation. 

A natural requirement for any adaptive algorithm is that adaptation stops after the algorithm has found the 
optimum filter co. However, owing to the randomness of the SOE and the finite amount of data used by the adaptive 
filter, its coefficients continuously fluctuate about their optimum settings, that is, about the coefficients of the 
optimum filter, in a random manner. As a result, the adaptive filter reaches a steady-state mode of operation, after a 
certain time, and its performance stops improving. 

The transient and steady-state modes of operation in a stationary SOE are illustrated in Figure 9.12(a). The 
duration of the acquisition phase characterizes the‘speed of adaptation or rate of convergence of the adaptive filter, 
whereas the steady-state EMSE or misadjustment characterizes the quality of adaptation. These properties depend on 
the SOE, the filtering structure, and the adaptive algorithm. 

At each time 7, any adaptive filter computes an estimate of the optimum filter using a finite amount of data. 
The error resulting from the finite amount of data is known as estimation error. An additional error, known as the lag 
error, results when the adaptive filter attempts to track a time-varying optimum filter c,(n) in a nonstationary SOE. 
The modes of operation of an adaptive filter in a nonstationary SOE are illustrated in Figure 9.12(b). The SOE of the 
adaptive filter becomes nonstationary if x(n,¢) or y(n,¢) or both are nonstationary. The nonstationarity of the 
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input is more severe than that of the desired response because it may affect the invertibility of R(n). Since the 
adaptive filter has to first acquire and then track the optimum filter, tracking is a steady-state property. Therefore, in 
general, the speed of adaptation (a transient-phase property) and the tracking capability (a steady-state property) are 
two different characteristics of the adaptive filter. Clearly, tracking is feasible only if the statistics of the SOE change 
“slowly” compared to the speed of tracking of the adaptive filter. These concepts will become more precise in Section 
9.8, where we discuss the tracking properties of adaptive filters. 
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FIGURE 9.12 
Modes of operation in a stationary and nonstationary SOE. 


9.2.4 Some Practical Considerations 


The complexity of the hardware or software implementation of an adaptive filter is basically determined by the 
following factors: (1) the number of instructions per time update or computing time required to complete one time 
updating; (2) the number of memory locations required to store the data and the program instructions; (3) the 
structure of information flow in the algorithm, which is very important for implementations using parallel processing, 
systolic arrays, or VLSI chips; and (4) the investment in hardware design tools and software development. We focus 
on implementations for general-purpose computers or special-purpose digital signal processors that basically involve 
programming in a high level or assembly language. More details about DSP software development can be found in 
Embree and Kimble (1991) and in Lapsley et al. (1997). 

The digital implementation of adaptive filters implies the use of finite-word-length arithmetic. As a result, the 
performance of the practical (finite-precision) adaptive filters deviates from the performance of ideal 
(infinite-precision) adaptive filters. Finite-precision implementation affects the performance of adaptive filters in 
several complicated ways. The major factors are (1) the quantization of the input signal(s) and the desired response, 
(2) the quantization of filter coefficients, and (3) the roundoff error in the arithmetic operations used to implement the 
adaptive filter. The nonlinear nature of adaptive filters coupled with the nonlinearities introduced by the 
finite-word-length arithmetic makes the performance evaluation of practical adaptive filters extremely difficult. 
Although theoretical analysis provides insight and helps to clarify the behavior of adaptive filters, the most effective 
way is to simulate the filter and measure its performance. 

Finite precision affects two important properties of adaptive filters, which, although related, are not equivalent. 
Let us denote by c(n) and c,,(n) the coefficient vectors of the filter implemented using infinite- and 
finite-precision arithmetic, respectively. An adaptive filter is said to be numerically stable if the difference vector 
Cip(n) —¢s,(n) remains always bounded, that is, the roundoff error propagation system is stable. Numerical stability 
is an inherent property of the adaptive algorithm and cannot be altered by increasing the numerical precision. Indeed, 
increasing the word length or reorganizing the computations will,simply delay the divergence of an adaptive filter; 
only actual change of the algorithm can stabilize an adaptive filter by improving the properties of the roundoff error 
propagation system (Ljung and Ljung 1985; Cioffi 1987). 

The numerical accuracy of an adaptive filter measures the deviation, at steady state, of any obtained estimates 
from theoretically expected values, due to roundoff errors. Numerical accuracy results in an increase of the output 
error without catastrophic problems and can be reduced by increasing the word length. In contrast, lack of numerical 
stability leads to catastrophic overflow (divergence or blowup of the algorithm) as a result of roundoff error 
accumulation. Numerically unstable algorithms converging before “explosion” may provide good numerical accuracy. 
Therefore, although the two properties are related, one does not imply the other. 
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Two other important issues are the sensitivity of an algorithm to bad or abnormal input data (e.g., poorly 
exciting input) and its sensitivity to initialization. All these issues are very important for the application of adaptive 
algorithms to real-world problems and are further discussed in the context of specific algorithms. 


9.3 Method of Steepest Descent 


Most adaptive filtering algorithms are obtained by simple modifications of iterative methods for solving deterministic 
optimization problems. Studying these techniques helps one to understand several aspects of the operation of adaptive 
filters. In this section we discuss gradient-based optimization methods because they provide the ground for the 
development of the most widely used adaptive filtering algorithms. 

As we discussed in Section 5.2.1, the error performance surface of an optimum filter, in a stationary SOE, is 
given by 


P(c) =P, —e"d—d"c+c"Re (9.3.1) 


where P, = E{| y(n) |’}. Equation (9.3.1) is a quadratic function of the coefficients and represents a bowl-shaped 
surface (when R is positive definite) and has a unique minimum at c, (optimum filter). There are two distinct ways to 
find the minimum of (9.3.1): 


1. Solve the normal equations Rc =d , using a direct linear system solution method. 
2. Find the minimum of P(c), using an iterative minimization algorithm. 


Although direct methods provide the solution in a finite number of steps, sometimes we prefer iterative methods 
because they require less numerical precision, are computationally less expensive, work when R is not invertible, and 
are the only choice for nonquadratic performance functions. 

In all iterative methods, we start with an approximate solution (a guess), which we keep changing until we reach 
the minimum. Thus, to find the optimum co, we start at some arbitrary point co, usually the null vector cp = 0, and then 
start a search for the “bottom of the bowl.” The key is to choose the steps in a systematic way so that each step takes 
us to a lower point until finally we reach the bottom. What differentiates various optimization algorithms is how we 
choose the direction and the size of each step. 


Steepest-descent algorithm (SDA) 
If the function P(c) has continuous derivatives, it is possible to approximate its value at an arbitrary 
neighboring point c+ Ac_ by using the Taylor expansion 





“ OP(c) LAA d P(c) 
Metta Ma) toe a ace, Ac, +++: (9.3.2) 
or more compactly 
P(c + Ac) = P(c)+(Ac)' VP(c) +5 (Ac) 1V*P(e)|(Ac) + (9.3.3) 


where VP(c) is the gradient vector, with elements dP(c)/dc;, and V?P(c) is the Hessian matrix, with elements 
0’ P(c)/(dc,dc;) . For simplicity we consider filters with real coefficients, but the conclusions apply when the 
coefficients are complex. For the quadratic function (9.3.1), we have 


VP(c) =2(Re -d) (9.3.4) 
V’P(c)=2R (9.3.5) 


and the higher-order terms are zero. For nonquadratic functions, higher-order terms are nonzero, but if ||Ac| is small, 
we can use a quadratic approximation. We note that if VP(c,)=0 and R is positive definite, then c, is the 
minimum because (Ac)"[V?P(c,)]-(Ac)>0O for any nonzero Ac. Hence, if we choose the step Ac so that 
(Ac)' VP(c) <0, we will have P(c + Ac) < P(c), that is, we make a step to a point closer to the minimum. Since 
(Ac)'VP(c) = ||Ac|||VP(c)||cos 6, the reduction in MSE is maximum when Ac =—VP(c). For this reason, the 
direction of the negative gradient is known as the direction of steepest descent. This leads to the following iterative 
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minimization algorithm 
c, =C, t WI-VP(e,_,)] k20 (9.3.6) 


which is known as the method of steepest descent (Scales 1985). The positive constant 4, known as the step-size 
parameter, controls the size of the descent in the direction of the negative gradient. The algorithm is usually 
initialized with cọ =0. The steepest-descent algorithm (SDA) is illustrated in Figure 9.13 for a single-parameter 
case. 


P(c) 





FIGURE 9.13 
Illustration of gradient search of the MSE surface for the minimum error point. 


For the cost function in (9.3.1), the SDA becomes 
C, =¢,_,+2U(d — Rex) = (I — 2uR)e,_, + 2ud (9.3.7) 


which is a recursive difference equation. Note that k denotes an iteration in the SDA and has nothing to do with time. 
However, this iterative optimization can be combined with filtering to obtain a type of “asymptotically” optimum 
filter defined by 


e(n,€) = y(n, )- c} x(n, €) (9.3.8) 


Ca =C, + 2d — Ren) (9.3.9) 
and is further discussed in Problem 9.2. 
There are two key performance factors in the design of iterative optimization algorithms: stability and rate of 
convergence. 


Stability 

An algorithm is said to be stable if it converges to the minimum regardless of the starting point. To investigate 
the stability of SDA, we rewrite (9.3.7) in terms of the coefficient error vector 
*e,-c, k20 (9.3.10) 


o 


Ck 


as Č =U -2uR)é, k20 (9.3.11) 
which is a homogeneous difference equation. Using the principal-components transformation R = Q AQ", we can 
write (9.3.11) as 


č, =(1-2nA) Ee, k20 (9.3.12) 


where ,=O"E, k20 (9.3.13) 


Or 


is the transformed coefficient error vector. Since A is diagonal, (9.3.12) consists of a set of M decoupled first-order 
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difference equations 
ri = (1-244 eas i=l, 2, =, M, k20 (9.3.14) 


with each describing a natural mode of the SDA. The solutions of (9.3.12) are given by 


i= (1-244) Čo k20 (9.3.15) 

If forall 1 <i <M 
-1 < 1-2y4å < 1 (9.3.16) 
or equivalently 0< u< 7 (9.3.17) 


t 
then ¢,;,1<i<M, tends to zero as ko. This implies that c, converges exponentially to c, as k — œ% 
because le: = lore A = lle il. If R is positive definite, its eigenvalues are positive and 


O<u< ss (9.3.18) 
Anax 
provides a necessary and sufficient condition for the convergence of SDA. 
To investigate the transient behavior of the SDA as a function of k , we note that using (9.3.10), (9.3.11), and 


(9.3.14), we have 


M 
Chi = Cag +) dik Čoi(l— 24A; ys (9.3.19) 
i=l 
where c,; are the optimum coefficients and q, the elements of the eigenvector matrix Q. The MSE at step k is 
M 
P, =P, +9, 40-244)" |čo f (9.3.20) 
i=l 


and can be obtained by substituting (9.3.19) in (9.3.1). If 4 satisfies (9.3.18), we have lim P, = P, and the MSE 


converges exponentially to the optimum value. The curve obtained by plotting the MSE P, as a function of the 
number of iterations k is known as the learning curve. 


Rate of convergence 

The rate (or speed) of convergence depends upon the algorithm and the nature of the performance surface. The 
most influential effect is inflicted by the condition number of the Hessian matrix that determines the shape of the 
contours of P(c). When P(c) is quadratic, it can be shown (Luenberger 1984) that 


A(R) -1 


A(R) +1 
where y(R)=A,,,,/Ami, is the condition number of R . If we recall that the eigenvectors corresponding to Anin 
and Apa point to the directions of minimum and maximum curvature, respectively, we see that the convergence 
slows down as the contours become more eccentric (flattened). For circular contours, that is, when 7(R)=1, the 
algorithm converges in one step. We stress that even if the M —1 eigenvalues of R are equal and the remaining one 
is far away, still the convergence of the SDA is very slow. 


The rate of convergence can be characterized by using the time constant T; defined by 


2 
P(c,)< l l P(c.) (9.3.21) 


1-244 = sof- ioe (9.3.22) 


T, 


i i 


which provides the time (or number of iterations) it takes for the ith mode c,; of (9.3.19) to decay to 1/e of its 
initial value cy,;. When “<1, we obtain 
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1 


T; =—— 
24A; 
for the MSE P, can be shown to be 
1 


T, == 
i,mse 4u, 


(9.3.23) 


In a similar fashion, the time constant T; 


1,mse 


(9.3.24) 


by using (9.3.20) and (9.3.22). 

Thus, for all practical purposes, the time constant (for coefficient cą or for MSE P,) of the SDA is 
T = 1/( Hmn), Which in conjunction with 4< 1/Amaxs results in T> Amax/Amin- Hence, the larger the eigenvalue 
spread of the input correlation matrix R, the longer it takes for the SDA to converge. 

In the following example, we illustrate above-discussed properties of the SDA by using it to compute the 
parameters of a second-order forward linear predictor. 


EXAMPLE 9.3.1. Consider a signal generated by the second-order autoregressive AR(2) process 
x(n)+ax(n-1)+a,x(n-2) = O(n) (9.3.25) 
where @(n) ~ WGN(0, 02) . Parameters a, and a are chosen so that the system (9.3.25) is minimum-phase. We want to design 
an adaptive filter that uses the samples x(n—1) and x(n—2) topredictthe value x(n) (desired response). 
If we multiply (9.3.25) by x(n—k), for k =0,1,2, and take the mathematical expectation of both sides, we obtain a set 


of linear equations 
r(0)+a,r(1) +.,r(2) = o% (9.3.26) 
r(l)+a,r(0)+a,r(l) =0 (9.3.27) 
r(2)+a,r(1)+a,r(0) =0 (9.3.28) 


which can be used to express the autocorrelation of x(n) in terms of model parameters a,,a,, and øg}. Indeed, solving (9.3.26) 
through (9.3.28), we obtain 











2 
Hee =o Ta 
l-a, (1+a,) -a;i 
r(1) = ma r(0) (9.3.29) 
l+a, 

a 
r(2) =|-a, + — ko 

+a, 





We choose GO? = 1, so that 


2 2 
o = Loa NA+ ay -al g? (9.3.30) 
l+a, Í 
The coefficients of the optimum predictor 
(n) = X(n) =c, .x(n—1) +c, .x(n—2) (9.3.31) 
are given by (see Section 6.5) 
r(O)c,,+r(I)c,, =r() 3.22) 
(9.3.33) 


ric, + r(O)c, > =r(2) 


with Pf =r(0)+r(ec 


ol 


+r(O)c, 5 (9.3.34) 
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whose comparison with (9.3.26) through (9.3.28) shows that C,; =—@,, c,2=—a,,and P/f = 03, as expected. 
The eigenvalues of the input correlation matrix 


Reio a (9.3.35) 
r(1) r(0) 





are (9.3.36) 


from which the eigenvalue spread is 
nis A, _l-a+a, (9.3.37) 


which, if a, >0 and a, <0, is larger than 1. 
Now we perform MATLAB experiments with varying eigenvalue spread y(R) and step-size parameter %4. In these 
experiments, we choose foz so that o? = 1. The SDA is given by 


C: = lc, Gal = Cia +24(d — Rc:-ı) 
where d =[r(1) r(2)]" and c, =[0 oJ" 


We choose two different sets of values for a; and a>, one for a small and the other for a large eigenvalue spread. These values are 


shown in Table 9.2 along with the corresponding eigenvalue spread y(R) andthe MMSE o3. 


TABLE 9.2 

Parameter values used in the SDA for the second-order forward prediction problem. 
Eigenvalue spread a a A A X(R) o, 
Small —0.1950 0.95 1.1 0.9 1.22 0.0965 
Large -1.5955 0.95 1.818 0.182 9.99 0.0322 


Using each set of parameter values, the SDA is implemented starting with the null coefficient vector co with two values of 
step-size parameters. To describe the transient behavior of the algorithm, it is informative to plot the trajectory of c,, versus c,, 
as a function of the iteration index k along with the contours of the error surface P(c,) . The trajectory of c, begins at the origin co 
= 0 and ends at the optimum value ¢,=—[a, a,]' . This illustration of the transient behavior can also be obtained in the domain 
of the transformed error coefficients ¢ A . Using (9.3.15), we see these coefficients are given by 


a|: | (9.3.38) 
Čk2 (1—24) čo2 
where Čo from (9.3.10) and (9.3.13) is given by 
ĉ&= ka =" č =Q" (c, -0,)=-Q"c, =Q" a Osc?) 
Cou ai 
Thus the trajectory of ¢ i begins at ¢ and ends at the origin ¢, =(0. The contours of the MSE function in the transformed 
domain are given by P, — P, . From (9.3.20), these contours are given by 


P, -PI =D AG) = AEs? +A Ee2)” 0340) 


Small eigenvalue spread and overdamped response. For this experiment, the parameter values were selected to obtain the 
eigenvalue spread approximately equal to 1 [ y(R)= 1.22 ]. The step size selected was y=0.15, which is less than 
1/ Aiai =1/1.1=0.9 for convergence. For this value of 4, the transient response is overdamped. Figure 9.14 shows four 
graphs indicating the behavior of the algorithm. In the graph (a), the trajectory of €, is shown for O<k <15 along with the 
corresponding loci ¢ 4 for a fixed value of P = PẸ . The first two loci for k =Q and 1 are numbered to show the direction 
of the trajectory. Graph (b) shows the corresponding trajectory and the contours for C, . Graph (c) shows plots of Cki and Ck2 
as a function of iteration step k , while graph (d) shows a similar learning curve for the MSE P . Several observations can be 
made about these plots. The contours of constant ¢ i are almost circular since the spread is approximately 1, while those of C, 
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are somewhat elliptical, which is to be expected. The trajectories of ¢ A and C, asa function of k are normal to the contours. 
The coefficients converge to their optimum values in a monotonic fashion, which confirms the overdamped nature of the response. 
Also this convergence is rapid, in about 15 steps, which is to be expected for a small eigenvalue spread. 

Large eigenvalue spread and overdamped response. For this experiment, the parameter values were selected so that the 
eigenvalue spread was approximately equal to 10 [ y(R) = 9.99 ]. The step size was again selected as ss = 0.15. Figure 9.15 
shows the performance plots for this experiment, which are similar to those of Figure 9.14. The observations are also similar except 
for those due to the larger spread. First, the contours, even in the transformed domain, are elliptical; second, the convergence is 


slow, requiring about 60 steps in the algorithm. The transient response is once again overdamped. 
1.5 1 





k=0. 
k=1 
k=2 


Ck,2 





(b) Locus of cx, versus cy > 


Parameters 


x 
EK IK KIO XX X 





k 
(c) c, learning curve (d) MSE P, learning curve 
FIGURE 9.14 
Performance curves for the steepest-descent algorithm used in the linear prediction problem with step-size parameter 4 =0.15 and 
eigenvalue spread y(R)= 1.22. 
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FIGURE 9.15 


Performance curves for the steepest-descent algorithm used in the linear prediction problem with step-size parameter ys =0.15 and 
eigenvalue spread y(R)=10. 
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FIGURE 9.16 
Performance curves for the steepest-descent algorithm used in the linear prediction problem with eigenvalue spread y(R)=10 
and varying step-size parameters 44 =0.15 and y =0.5. 


Note that the coefficients converge in an oscillatory fashion; however, the convergence is fairly rapid compared to that of the 
overdamped case. Thus the selection of the step size is an important design issue. 


Newton’s type of algorithms 

Another family of algorithms with a faster rate of convergence includes Newton’s method and its modifications. 
The basic idea of Newton’s method is to achieve convergence in one step when P(c) is quadratic. Thus, if c, is to 
be the minimum of P(c), the gradient VP(c,) of P(c) evaluated at c, (9.2.19) should be zero. From (9.2.19), 
we can write 


VP(c,) = VP(c,_,)+ V’P(c,_, Ac, =0 (9.3.41) 
Thus VP(c,)=0 leads to the step increment 
Ac, =-[V’P(e,_,)1'VP(e,_,) (9.3.42) 
and hence the adaptive algorithm is given by 
c, =C, -H[V’P(e,_,) I VPC,.,) (9.3.43) 
where £/>0 is the step size. For quadratic error surfaces, from (9.3.4) and (9.3.5), we obtain with 4 =1 
c, =C, [V P(e, VP(c,_,) =¢,.,-(¢,., -R'd) =e, (9.3.44) 


which shows that indeed the algorithm converges in one step. 
For the quadratic case, since V?P(c,_,)=2R from (9.3.1), we can express Newton’s algorithm as 


c, =€, -HR'VP(c,_,) (9.3.45) 
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where y is the step size that regulates the convergence rate. Other modified Newton methods replace the Hessian 
matrix V*P(c,_,) with another matrix, which is guaranteed to be positive definite and, in some way, close to the 
Hessian. These Newton-type algorithms generally provide faster convergence. However, in practice, the inversion of 
R is numerically intensive and can lead to a numerically unstable solution if special care is not taken. Therefore, the 
SDA is more popular in adaptive filtering applications. 

When the function P(c) is nonquadratic, it is approximated locally by a quadratic function that is minimized 
exactly. However, the step obtained in (9.3.42) does not lead to the minimum of P(c), and the iteration should be 
repeated several times. A more detailed treatment of linear and nonlinear optimization techniques can be found in 
Scales (1985) and in Luenberger (1984). 


9.4 Least-Mean-Square Adaptive Filters 


In this section, we derive, analyze the performance, and present some practical applications of the least-mean-square 
(LMS) adaptive algorithm. The LMS algorithm, introduced by Widrow and Hoff (1960), is widely used in practice 
due to its simplicity, computational efficiency, and good performance under a variety of operating conditions. 


9.4.1 Derivation 


We first present two approaches to the derivation of the LMS algorithm that will help the reader to understand its 
operation. The first approach uses approximation to the gradient function while the second approach uses geometric 
arguments. 


Optimization approach. The SDA uses the second-order moments R and d to iteratively compute the optimum 
filter c, = R-'d, starting with an initial guess, usually cy =0, and then obtaining better approximations by taking 
steps in the direction of the negative gradient, that is, 


Ck = Oy) + u-VP(e,_, )] (9.4.1) 


where VP(c,_,) = 2(Rey1—4) (9.4.2) 


is the gradient of the performance function (9.3.1). In practice, where only the input {x(j)}5 and the desired 
response { y( j)}p are known, we can only compute an estimate of the “true” or exact gradient (9.4.2) using the 
available data. To develop an adaptive algorithm from (9.4.1), we take the following steps: (1) replace the iteration 
subscript k by the time index n; and (2) replace R and d by their instantaneous estimates x(n)x"(n) and x(n)y*(n), 
respectively. The instantaneous estimate of the gradient (9.4.2) becomes 


VP(e,_,) =2Re,x— 2d = 2x(n)x"(n)e(n—1)— 2x(n)y *(n) =—2x(n)e *(n) (9.4.3) 


where e(n) = y(n) - c™(n — 1)x(n) (9.4.4) 


is the a priori filtering error. The estimate (9.4.3) also can be obtained by starting with the approximation 
P(c) =|e(n) and taking its gradient. The coefficient adaptation algorithm is 


c(n) =c(n—1)+2yx(n)e"(n) (9.4.5) 


which is obtained by substituting (9.4.3) and (9.4.4) in (9.4.1). The step-size parameter 244 is also known as the 
adaptation gain. 

The LMS algorithm, specified by (9.4.5) and (9.4.4), has both important similarities to and important differences 
from the SDA (9.3.7). The SDA contains deterministic quantities while the LMS operates on random quantities. The 
SDA is not an adaptive algorithm because it only depends on the second-order moments R and d and not on the SOE 
{x(n, 6), y(n, ¢)}. Also, the iteration index k has nothing to do with time. Simply stated, the SDA provides an 
iterative solution to the linear system Re =d.. 


Geometric approach. Suppose that an adaptive filter operates in a stationary signal environment seeking the 
optimum filter c,. At time n the filter has access to input vector x(n), the desired response y(n), and the 
previous or old coefficient estimate c(n—1). Its goal is to use this information to determine a new estimate c(n) 
that is closer to the optimum vector €, or equivalently to choose c(n) so that |E(n)| <ë- D], where 
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ĉ(n)=c(n)—c, is the coefficient error vector given by (9.2.24). Eventually, we want |e (n)|| to become negligible 
as N— oo, 
The vector ¢(n—1) can be decomposed into two orthogonal components 


€(n-1) =é,(n—-1) + &,(n-1) (9.4.6) 


one parallel and one orthogonal to the input vector x(n), as shown in Figure 9.17(a). The response of the error filter 
€(n—1) totheinput x(n) is 


Jn) = E"(n—1)x(n) = (n -1)x(n) (9.4.7) 


which implies that é.(n—- DOR x(n) (9.4.8) 


which can be verified by direct substitution in (9.4.7). Note that x(n)/ ||x( n)| is a unit vector along the direction of x(n) . 


—2e,(n— 1) 


(b) 





FIGURE 9.17 
The geometric approach for the derivation of the LMS algorithm. 


If we only know x(n) and (n), the best strategy to decrease ĉ(n) is to choose ĉ(n)=¢}(n—1), or 
EAr subtract é (n—1) from ¢c(n—1). From Figure 9.17(a) note that as long as ¢,(n—1)#0, 
lle - Dij. This suggests the following adaptation algorithm 








c(n)=C(n yin iy (9.4.9) 
A T 


which guarantees that lwl < |ë- 1)| as long as 0< <2 and y(n) #0, as shown in Figure 9.17 (b). 
The best choice clearly is /=1. 

Unfortunately, the signal (n) is not available, and we have to replace it with some reasonable approximation. 
From (9.2.18) and (9.2.10) we obtain 





&(n) = e(n)—e,(n) = y(n) — $(n) — y(n) + §,(n) = §,(n) — $(n) 
=[e} —e8(n—1)]x(n) = -6"(n - )x(n) =-F(n) 


(9.4.10) 


where we have used (9.4.7). Using the approximation 
@(n) = e(n) —e, (n) = e(n) 
we combine it with (9.4.10) to get 


c(n) =¢e(n— D+A ee (9.4.11) 
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which is known as the normalized LMS algorithm. Note that the effective step size [i/ kol is time-varying. The 
LMS algorithm in (9.4.5) follows if we set ||x(n)||=1 and choose #=2y. 


LMS algorithm. The LMS algorithm can be summarized as 


§(n) =e" (n—1)x(n) filtering 
e(n) = y(n) — $(n) error formation (9.4.12) 
c(n) =c(n—1)+2ux(n)e"(n) coefficient updating 


where {/ is adaptation step size. The algorithm requires 2M +1 complex multiplications and 2M complex 
additions. Figure 9.18 shows an implementation of an FIR adaptive filter using the LMS algorithm, which is 
implemented in MATLAB using the function [yhat,c] = firlms(x,y,M,mu). Thea posteriori form of the LMS 
algorithm is developed in Problem 9.9. 


x(n) x(n) 


x(n—M+2) x(n-M+1) 





x(n- 1) 





FIGURE 9.18 
An FIR adaptive filter realization using the LMS algorithm. 


9.4.2 Adaptation in a Stationary SOE 


In the sequel, we study the stability and steady-state performance of the LMS algorithm in a stationary SOE; that is, 
we assume that the input and the desired response processes are jointly stationary. In theory, the goal of the LMS 
adaptive filter is to identify the optimum filter €, = R™'d from observations of the input x(n) and the desired 
response 


y(n) =c"'x(n)+e,(n) (9.4.13) 


The optimum error e,(n) is orthogonal to the vector x(n); that is, E{x(n)e"(n)}=0 and acts as 
measurement or output noise, as shown in Figure 9.19. 


e(n) 





FIGURE 9.19 
LMS algorithm in a stationary SOE. 


The first step in the statistical analysis of the LMS algorithm is to determine a difference equation for the coefficient 
error vector ¢(n). To this end, we subtract c, from both sides of (9.4.5), to obtain 


č(n)=2(n-1)+2ux(n)e* (n) (9.4.14) 
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which expresses the LMS algorithm in terms of the coefficient error vector. We next use (9.4.12) and (9.4.13) in 
(9.4.14) to eliminate e(n) by expressing it in terms of ¢(m—1) and e,(n). The result is 


e(n)=[1- 2ux(n)x" (n)]é(n—1) + 2ux(n)e*(n) (9.4.15) 


which is a time-varying forced or nonhomogeneous stochastic difference equation. The irreducible error e,(n) 
accounts for measurement noise, modeling errors, unmodeled dynamics, quantization effects, and other disturbances. 
The presence of e,(n) prevents convergence because it forces ¢(n) to fluctuate around zero. Therefore, the 
important issue is the BIBO stability of the system (9.4.15). From (9.2.28), we see that lE] is bounded in mean 
square if we can show that E{č(n)}—> 0 as n—œ and var{é,(n)} is bounded for all n. To this end, we 
develop difference equations for the mean value E{ĉ(n)} and the correlation matrix 


@(n) Ê E{é(n)é"(n)} (9.4.16) 


of the coefficient error vector ¢(n). As we shall see, the MSD and the EMSE can be expressed in terms of matrices 
@(n) and R. The time evolution of these quantities provides sufficient information to evaluate the stability and 
steady-state performance of the LMS algorithm. 


Convergence of the mean coefficient vector 
If we take the expectation of (9.4.15), we have 


E{é(n)} = E{é(n—1)}—2uWE{x(n)x" (n)é(n—-1)} (9.4.17) 


because E{x(n)e,(n)}=0 owing to the orthogonality principle. The computation of the second term in (9.4.17) 
requires the correlation between the input signal and the coefficient error vector. 
If we assume that x(n) and ĉ(n—1) are statistically independent, (9.4.17) simplifies to 
E{é(n)} =U -—2uR)E{e(n-1)} (9.4.18) 


which has the same form as (9.3.11) for the SDA. Therefore, ¢(n) converges in the MS sense, that is, 


lim E{é(n)} = 0, if the eigenvalues of the system matrix (J —2R) are less than 1. Hence, if R is positive definite 


and Anax is its maximum eigenvalue, the condition 


0<2u< + (9.4.19) 


max 


ensures that the LMS algorithm converges in the MS sense [see the discussion following (9.2.27)]. 


Independence assumption. The independence assumption between x(n) and ¢(n—1) was critical to the 
derivation of (9.4.18). To simplify the analysis, we make the following independence assumptions (Gardner 1984): 


Al The sequence of input data vectors x(n) is independently and identically distributed with zero mean and 
correlation matrix R. 
A2 Thesequences x(n) and e,(n) are independent for all n. 


From (9.4.15), we see that €(n—1) depends on ¢(0), {x(k)}j', and {e,(k)}§7'. Since the sequence x(n) 
is IID and the quantities x(n) and e,(n) are independent, we conclude that x(n), e,(n), and ¢(n—1) are 
mutually independent. This result will be used several times to simplify the analysis of the LMS algorithm. 

The independence assumption A1, first introduced in Widrow et al. (1976) and in Mazo (1979), ignores the 
statistical dependence among successive input data vectors; however, it preserves sufficient statistical information 
about the adaptation process to lead to useful design guidelines. Clearly, for FIR filtering applications, the 
independence assumption is violated because two successive input data vectors x(n) and x(n+1) have M —1 
common elements (shift-invariance property). 
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Evolution of the coefficient error correlation matrix 
The MSD can be expressed in terms of the trace of the correlation matrix” (n), that is, 
D(n) = tr[®(n)] (9.4.20) 
which can be easily seen by using (9.2.29) and the definition of trace. If we postmultiply both sides of (9.4.15) by 
their respective Hermitian transposes and take the mathematical expectation, we obtain 


@(n) = E{é(n)é"(n)} 
= E{[I —2ux(n)x"(n)\e(n —De"(n - DU -2ux(n)x"(n)]"} 
+2ME{[I —2ux(n)x" (n) (n -—1)e,(n)x" (n)} (9.4.21) 
+2uE{x(n)e;(n)č"(n -DU -2x (n)x" (n)]" } 
+44? E{x(n)e*(n)e,(n)x"(n)} 


From the independence assumptions, e,(n) is independent with ĉ(n—1) and x(n). Therefore, the second and 
third terms in (9.4.21) vanish, and the fourth term is equal to 44?P,R . If we expand the first term, we obtain 


O(n) = O(n —1)—2u[RO(n -1) + @(n-1)R] +47 A+4PR (9.4.22) 


where A£ E{x(n)x" (n)é(n—Dé"(n—-1)x(n)x"(n)} (9.4.23) 


and the terms R ®(n—1) and @®(n-—1)R have been computed by using the mutual independence of 
x(n), C(n—1), and e,(n). 

The computation of matrix A can be simplified if we make additional assumptions about the statistical properties 
of x(n). As shown in Gardner (1984), development of a recursive relation for the elements of ®(n) using only the 
indenpendent assumptions requires the products with and the inversion of a M?xM7? matrix, where M is the size 
of x(n). 

The evaluation of this term when x(n) ~IID, an assumption that is more appropriate for data transmission 
applications, is discussed in Gardner (1984). The computation for x(n) being a spherically invariant random 
process (SIRP) is discussed in Rupp (1993). SIRP models, which include the Gaussian distribution as a special case, 
provide a good characterization of speech signals. However, independently of the assumption used, the basic 
conclusions remain the same. 

Assuming that x(n) is normally distributed, that is, x(n)~ —/(0,R) , a significant amount of 
simplification can be obtained. Indeed, in this case we can use the moment factorization property for normal random 
variables to express fourth-order moments in terms of second-order moments (Papoulis 1991). If z1, z2, z3, and z4 are 
complex-valued, zero-mean, and jointly distributed normal random variables, then 


E(2,232324} = E{z,2; }E( 2,24} + E( 2,2, }E({225} (9.4.24) 
or if they are real-valued, then 
E{2,2,2324} = E{z,2,}E{z,2,}+ E{z,Z,}E{z,z,}+ E{z,Z,}E{z,2;} (9.4.25) 


Using direct substitution of (9.4.24) or (9.4.25) in (9.4.23), we can show that 


(9.4.26) 


_ | RØ(n-ID)R+Rt[RØ(n-1)] complex case 
~ |2R@(n-1)R+Rtr[RO(n-1)] real case 


Finally, substituting (9.4.26) in (9.4.22), we obtain a difference equation for ®(n). This is summarized in the 
following property: 


Note that when (10.4.19) holds, lim, _,.. E{€(n)}=0, and therefore ®(n) provides asymptotically the covariance of €(n). 
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PROPERTY 9.4.1. Using the independence assumptions Al and A2, and the normal distribution assumption of x(n), the 
correlation matrix of the coefficient error vector €(n) satisfies the difference equation 
@P(n) = O(n —-1) —2u[RP(n—-1) + O(n—-1)R] 
+4°R®(n—-1)R+4R tr[RO(n—-1)]+ 4u°PR 


(9.4.27) 


in the complex case and 
@(n) = O(n —-1) -—2u[RO(n-1) + O(n—-1)R] 
+8 RØ(n—1)R +4? R t[RØ(n-—1)]+ 4? PR 


in the real case. Both relations are matrix difference equations driven by the constant term 44?P,R . 


(9.4.28) 


The presence of the term 4y?P,R_ in (9.4.27) or (9.4.28) implies that ®(n) will never become zero, and as a 
result the coefficients of the LMS adaptive filter will always fluctuate about their optimum settings, which prevents 
convergence. It has been shown (Bucklew et al. 1993) that asymptotically ¢(n) follows a zero-mean normal 
distribution. The amount of fluctuation is measured by matrix @(n). In contrast, the absence of a driving term in 
(9.4.18) allows the convergence of E{c(n)} to the optimum vector co. 

Since there are two distinct forms for the difference equation of ®(n), we will consider the real case (9.4.28) 
for further discussion. Similar analysis can be done for the complex case (9.4.27), which is undertaken in Problem 
9.11. To further simplify the analysis, we transform @(n) to the principal coordinate space of R using the spectral 
decomposition 


O'RO=A 


by defining the matrix O(n) = O'H(n)O (9.4.29) 
which is symmetric and positive definite [when @®(n) is positive definite]. 
If we pre- and postmultiply (9.4.28) by Q" and Q anduse Q'Q@ =QQ' = I , we obtain 
O(n) = O(n -1) -2u[AO(n-1) + O(n-1) A] 


(9.4.30) 
+87 AO(n-1)A+ 4p Atr[A@(n-1)]+4iP,A 


which is easier to work with because of the diagonal nature of A. For any symmetric and positive definite matrix 
©, we have |6;(n)|’< 6,0; . Hence, the convergence of the diagonal elements ensures the convergence of the 
off-diagonal elements. This observation and (9.4.30) suggest that to analyze the LMS algorithm, we should extract 
from (9.4.30) the equations for the diagonal elements 


O(n) Ê [0 (n) O,(n) --- A, (n) (9.4.31) 


of O(n) and form a difference equation for the vector @(n) . Indeed, we can easily show that 


A(n) = BO(n-1)+4u"°PA (9.4.32) 
where B Ê A(p)+4u 4a" (9.4.33) 
LJA Ao ALP (9.4.34) 
A(p) = diag{ p,» P2»: Py} (9.4.35) 
pP, =1-4uA, +84 A = (1-2, +40" A >0 1<k<M (9.4.36) 
and/, are the eigenvalues of R. The solution of the vector difference equation (9.4.32) is 
n-l 
e(n) = B"9(0) + 4" P,>) B’A (9.4.37) 
j=0 


and can be easily found by recursion. 
The stability of the linear system (9.4.32) is determined by the eigenvalues of the symmetric matrix B. Using 
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(9.4.3) and (9.4.35), for an arbitrary vector z, we obtain 
M 
z"Bz =Z" A(p)z +44 (42) =} pezi +4 (ATzy (9.4.38) 


k=l 
where we have used (9.4.36). Hence (9.4.38), for z #0, implies that z” Bz > 0, that is, the matrix B is positive 
definite. Since matrix B is symmetric and positive definite, its eigenvalues A,(B) are real and positive. The system 
(9.4.37) will be BIBO stable if and only if 


0</4,(B)<1 1<k<M (9.4.39) 


To find the range of 4 that ensures (9.4.39), we use the Gerschgorin circles theorem (Noble and Daniel 1988), 
which states that each eigenvalue ofan MXM matrix B lies in at least one of the disks with center at the diagonal 
element b, and radius equal to the sum of absolute values |b, |, j #k, of the remaining elements of the row. 
Since the elements of B are positive, we can easily see that 


M M 

A, (B) — by <b, or A,(B) <p, +4 A2, A, 
j=l j=l 
j#k 


using (9.4.33). Hence using (9.4.36), we see the eigenvalues of B satisfy (9.4.39) if 
1-4 yA, +847 A} +472, rR <1 


or —pA,+ 2A +A tr R<0 
which implies that 4 >0 and 
2u< = < id 
A,+trR trR 
because A, >0 forall k . In conclusion, if the adaptation step { satisfies the condition 
0<2u< E (9.4.40) 
tr R 


then the system (9.4.37) is stable and therefore the sequence O(n) converges. 


PROPERTY 9.4.2. When the stability condition (9.4.40) holds, the solution (9.4.37) of the difference equation (9.4.32) can be 
written as 


O(n) = B"[A(0) — A(ce)] + O(c) (9.4.41) 


where @(0) is the initial value and @(0°) is the steady-state value of O(n) . 
Proof. Using the identity 


5 Bİ =(1-B")(1-B)' = (I — By` — B" (I - B) `' 
the solution (9.4.37) wo written as 
O(n) = B"[0(0) - 4° PI —B)'A)+4u°P(1-B)'A (9.4.42) 
When the eigenvalues of B are inside the unit circle, we have 


lim O(n) = 0(%) = 4" P (I - B)' 4 (9.4.43) 


because the first term converges to zero. Substituting (9.4.43) in (9.4.42), we obtain (9.4.41). 
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Evolution of the mean square error 
We next express the MSE as a function of 4 and @. Using (9.2.10) and (9.2.18), we have 


e(n) = y(n)—c" (n—1)x(n) = e,(n) —é"(n—-1)x(n) (9.4.44) 


where e,(n) is the optimum filtering error and č(n) is the coefficient error vector. The (a priori) MSE of the 
adaptive filter at time n is 


P(n) = E{| e(n) }} 
=E{|e, (n) }}- E{é"(n-1)x(n)e*(n)} - Efe, (n)x" (n)é(n-1)} (9.4.45) 
+ E{é"(n—-1)x(n)x" (n)é(n—-1)} 


Since ¢(n) is a random vector, the evaluation of the MSE (9.4.45) requires the correlation between x(n) and 
¢(n—1) . Using the independence assumptions A1 and A2, we see that the second and third terms in (9.4.45) become 
zero, as explained before, and the excess MSE is given by the last term 


P (n) = E{é"(n-1)x(n)x" (n)é(n-1)} (9.4.46) 
If we define the quantities 
Afe(n-1) and = B=x(n)x"(n)é(n-1) (9.4.47) 
and notice that AB =tr(AB) (because AB is a scalar) and tr(AB) = tr(BA), we obtain 
P(n) = E{tr(AB)} = E{tr(BA)} = tr(E{BA}) 
= tr(E{x(n)x" (n)}E{E(n—1)é"(n-1)}) 


because expectation is a linear operation and x(n) and ¢(n—1) have been assumed statistically independent. 
Therefore, the excess MSE can be expressed as 


P (n) = tr[R®(n—-1)] (9.4.48) 
where ®(n) = E{é(n)é"(n)} is the correlation matrix of the coefficient error vector. This expression simplifies to 
P(n)=Moro? (9.4.49) 


if R=o2I and ®(n)=o721. 

If Rand @(n) are both positive definite, relation (9.4.48) shows that P.,(n)>0, that is, the MSE attained by 
the adaptive filter is larger than the optimum MSE P, of the optimum filter (cost of adaptation). 

Next we develop a difference equation for P.,(n), using, for convenience, the principal coordinate system of 
the input correlation matrix R. Since the trace of a matrix remains invariant under an orthogonal transformation, we 
have 


P (n) = tr[R®(n)] = tr[AO(n)] = 2'O(n) (9.4.50) 


where the elements of A are the eigenvalues of R and the elements of @(n) are the diagonal elements of 
O(n). 

Since the most often observable and important quantity for the operation of an adaptive filter is the MSE, we use 
our previous results to determine the value of MSE as a function of 7, that is, the learning curve of the LMS 
adaptive filter. To this end, we use the orthogonal decomposition B = Q(B)A(B)Q"(B) to express B” as 


B" = Q(B)A"(B)Q"(B) = >° 47 (B)q,(B)q; (B) (9.4.51) 
k=1 


where A (B) are the eigenvalues and g,(B) are the eigenvectors of matrix B. Substituting (9.4.41) and (9.4.51) 
into (9.4.50) and recalling that P(n) = P, + P.,(n), we obtain 
P(n)= F + F,(n)+ P, (2) (9.4.52) 
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where P.,(cc) is termed the steady-state excess MSE and 
P.(n) = <4 y, (R, B) Az (B) (9.4.53) 
is termed the transient MSE because it dies out leases when 0<A,(B) <1, 1<k <M. The constants 
Y, (R, B) = A" (R)q, (B)q, (B)[A(0) — A(~)] (9.4.54) 


are determined by the eigenvalues A, (R) of matrix R and the eigenvectors g,(B) of matrix B. Since the 
minimum MSE P, is available, we need to determine the steady-state excess MSE P. (0°). 


PROPERTY 9.4.3. When the LMS adaptive algorithm converges, the steady-state excess MSE is given by 








P, (00) = Pp — (9.4.55) 
1- C(x) 
M 
where cujè y (9.4.56) 
we har- 24A, 


and A, are the eigenvalues of the input correlation matrix. 
Proof. Using (9.4.32) and (9.4.35), we obtain the difference equation 
9, (n) = P,9,(n—-1) +4urA, Pi (n-1)+4u’P.A, (9.4.57) 
When (9.4.40) holds, (9.4.57) attains the following steady-state form 
8, (ce) = 2,4, (ce) + 4u°A, Pa (©) + 4p’ P A, 


whose solution, in conjunction with (9.4.36), gives 





_ P +P, (%) 
8, ( a aC 
M M 
HA, 
d P,, (=)= ,6,(c) =[P, + P, (%)] 
ry LA EY 


Solving the last equation for P., (cc) , we obtain (9.4.55) and (9.4.56). 


Solving (9.4.55) for C(x) gives 


P, (ee) 
Cu =— (9.4.58) 
OOP +P) 
which implies that 0<C(u) <1 (9.4.59) 


because P, and P..(cc) are positive quantities. It has been shown that (9.4.59) leads to the tighter bound 
0<2u<2/(3trR) for the adaptation step s (Horowitz and Senne 1981; Feuer and Weinstein 1985). Therefore, 
convergence in the MSE imposes a stronger constraint on the step size jz than does (9.4.40), which ensures 
convergence in the mean. 


9.4.3 Summary and Design Guidelines 


There are many theoretical and simulation analyses of the LMS adaptive algorithm under a variety of assumptions. In 
this book, we have focused on results that help us to understand its operation and performance and to develop design 
guidelines for its practical application. The operation and performance of the LMS adaptive filter are determined by 
its stability and the properties of its learning curve, which shows the evolution of the MSE as a function of time. The 
MSE produced by the LMS adaptive algorithm consists of three components [see (9.4.52)] 
P(n) = P,+P,(n)+P., (ee) 

where P, is the optimum MSE, P,(n) is the transient MSE, and P,,(cc) is the steady-state excess MSE. This 
equation provides the basis for understanding and evaluating the operation of the LMS adaptive algorithm in a 
stationary SOE. For convenience, the LMS adaptive filtering algorithm is summarized in Table 9.3. 
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TABLE 9.3 
Summary of the LMS algorithm. 
Design parameters 
x(n) = input data vector at time n 
y(n) = desired response at time n 
c(n) = filter coefficient vector at time n 
M = number of coefficients 
4 = step-size parameter 


0<u«<;— 
È Elle wP) 
k=l 





Initialization 
e(-1) = x(-1) =0 
Computation 
For n=0,1,2,---, compute 
$(n) =c" (n—-1)x(n) 
e(n) = y(n) — $(n) 
c(n) =e(n —1) + 2pux(n)e*(n) 


Stability. The LMS adaptive filter converges in the mean-square sense, that is, the transient MSE dies out, if the 
adaptation step 4 satisfies the condition 


K 
0 < 2u < — (9.4.60 
x tr R ) 


where trR is the trace of the input correlation matrix and K is a constant that depends weakly on 
the statistics of the input data vector. For example, when x(n)~ a~ (0,R), we proved that K=1 or 2/3. 


In addition, this condition ensures that on average the LMS adaptive filter converges to the optimum filter. We stress 
that in most practical applications, where the independence assumption does not hold, the step size 4 should be 
much smaller than K/ tr R . Therefore, the exact value of K is not important in practice. 


Rate of convergence. The transient MSE dies out exponentially without exhibiting any oscillations. This follows 
from (9.4.53) because when satisfies (9.4.40), the eigenvalues of matrix B are positive and less than 1. The 
settling time, that is, the time taken for the transients to die out, is proportional to the average time constant 

1 


Tims,av = A 
av 


M G 
where Aw SO A, )IM_is the average eigenvalue of R (Widrow et al. 1976). The quantity pt => P(n), 
í n=0 


k=l 


(9.4.61) 


which provides the total transient MSE, can be used as a measure for the speed of adaptation. When 4A, <1 (see 
Problem 9.12), we have 


co M 
PS} P= sy A@,(0) (9.4.62) 
n=0 4U 


where A@,(0) is the initial distance of a coefficient from its optimum setting measured in principal coordinates. As 
is intuitively expected, the smaller the step size and the farther the initial coefficients are from their optimum settings, 
the more iterations it takes for the LMS algorithm to converge. Furthermore, from the discussion in Section 9.3, it 
follows that the LMS algorithm will converge faster if the contours of the error surface are circles, that is, when the 
input correlation matrix is R=o2]. 


Steady-state excess MSE. The excess MSE after the adaptation has been completed (i.e., the steady-state value) 
is given by (9.4.55). When 4A, <1, we may approximate (9.4.55) as follows 
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P(o) = P, Ae 
l1- utr R 
which allows a much easier interpretation. Solving for utr R, we obtain wtrR = P,(0)/[P.,(c)+P,] which 
implies that 0< wtrR<1.Since y tr R <1, we often use the approximation 


P, (œ) = MP, tr R ER 


which implies that P., (°°) « P,, that is, for small values of the step size the excess MSE is much smaller than the 
optimum MSE. Note that the presence of the irreducible error e,(n) prevents perfect adaptation as n — oo 
because P, >00. 


Speed versus quality of adaptation. From the previous discussion we see that there is a tradeoff between rate of 
convergence (speed of adaptation) and steady-state excess MSE (quality of adaptation, or accuracy of the adaptive 
filter). The first requirement for an adaptive filter is stability, which is ensured by choosing x to satisfy (9.4.60). 
Within this range, decreasing jy to reduce the desired level of misadjustment, according to (9.4.63), decreases the 
speed of convergence; see (9.4.62). Conversely, if 4 is increased to increase the speed of convergence; this results 
in an increase in misadjustment. This tradeoff between speed of convergence and misadjustment is a fundamental 
feature of the LMS algorithm. 


FIR filters. In this case, the input is a stationary process x(n) with a Toeplitz correlation matrix R. Therefore, 
we have 


tr R = Mr(0) = ME{| x(n)? } = MP. (9.4.64) 


where MP. is called the tap input power. Substituting (9.4.40) into (9.4.64), we obtain 
0<2u< aE = ae es (9.4.65) 
MP, tap input power 
which shows that the selection of the step size depends on the input power. Using (9.4.63) and (9.4.64), we see that 
misadjustment M is given by 

M= ia) = LMP. (9.4.66) 
which shows that for given M and P, the value of misadjustment is proportional to jz. We emphasize that the 

misadjustment provides a measure of how close an LMS adaptive filter is to the corresponding optimum filter. 
The statistical properties of the SOE, that is, the correlation of the input signal and the cross-correlation between 


input and desired response signals, play a key role in the performance of the LMS adaptive filter. 


e First, we should make sure that the relation between x(n) and y(n) can be accurately modeled” or lack of 
correlation between x(n) and y(n) increases the magnitude of the irreducible error. If M is very large, we may 
want to use a pole-zero IIR filter (Shynk 1989; Treichler et al. 1987). If the relationship between x(n) and y(n) 
is nonlinear, we certainly need a nonlinear filtering structure (Mathews 1991). 

e The LMS algorithm uses a “noisy” instantaneous estimate of the gradient vector. However, when the correlation 
between input and desired response is weak, the algorithm should make more cautious steps (“wait and average”). 
Such algorithms update their coefficients every L samples, using all samples between successive updatings to 
determine the gradient (gradient averaging). 

e The eigenvalue structure of R as measured by its eigenvalue spread (A,,.x/Amin) OF equivalently by the spectral 
flatness measure (SFM) (see Section 3.1) has a strong effect on the rate of convergence of the LMS algorithm. In 
general, the rate of convergence decreases as the eigenvalue spread increases, that is, as the contours of the cost 
function become more elliptical, or equivalently the input spectrum becomes more nonwhite. 


Normalized LMS algorithm. According to (9.4.60), the selection of j in practical applications is complicated 
because the power of the input signal either is unknown or varies with time. This problem can be addressed by using 


® by a linear FIR filter with M coefficients. Inadequacy of the FIR structure, output observation noise. 
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the normalized LMS (NLMS) algorithm [see (9.4.11)] 


e(n) =¢(n—1) + —=—x(n)e"(n) (9.4.67) 
Ey (n) 
where Ey, (n)= EO and 0 < # < 1. It can be shown that the NLMS algorithm converges in the mean 
square if 0< <1 (Rupp 1993; Slock 1993), which makes the selection of the step size # much easier than the 
selection of in the LMS algorithm. 
For FIR filters, the quantity Ey(n) provides an estimate of ME{| x(n) |} and can be computed recursively 
by using the sliding-window formula 





E,, (n) = E,, (n-1)+| x(n)? s |x(n—M)/? (9.4.68) 


where Ey (—1)=0 ora first-order recursive filter estimator. In practice, to avoid division by zero, if x(n) =0, we 
set Ey(n)=d+ EO , where ĝ is a small positive constant. 


Other approaches and analyses. The analysis of the LMS algorithm presented in this section is simple, clarifies 
its performance, and provides useful design guidelines. However, there are many other approaches, which are beyond 
the scope of this book, that differ in terms of complexity, accuracy, and objectives. Major efforts to remove the 
independence assumption and replace it with the more realistic statistically dependent input assumption are 
documented in Macchi (1995), Solo (1997), and Butterweck (1995) and the references therein. Convergence analysis 
of the LMS algorithm using the stochastic approximation approach and a deterministic approach using the method of 
ordinary differential equations are discussed in Solo and Kong (1995), Sethares (1993), and Benveniste et al. (1987). 
Other types of analyses deal with the determination of the probability densities and the probability of large excursions 
of the adaptive filter coefficients for various types of input signals (Rupp 1995). The analysis of the convergence 
properties of the LMS algorithm and its variations is still an active area of research, and new results appear 
continuously. 


9.4.4 Applications of the LMS Algorithm 


We now discuss three practical applications in which the LMS algorithm has made a significant impact. In the 
first case, we consider the previously discussed linear prediction problem and compare the performance of the LMS 
algorithm with that of the SDA. Table 9.4 provides a summary of the key differences between the SDA and the LMS 
algorithms. In the second case, we study echo cancelation in full-duplex data transmission, which employs the LMS 
algorithm in its implementation. In the third case, we discuss the application of adaptive equalization, which is used 
to minimize intersymbol interference (ISI) in a dispersive channel environment. 





TABLE 9.4 
Comparison between the SDA and LMS algorithms. 
SDA LMS 
Deterministic algorithm: Stochastic algorithm: 
lime(n) =e, l lim E{e(n)} =c, 
If converges, it terminates to €, If converges, it fluctuates about €, 
The size of fluctuations is proportional to 4 
Noiseless gradient estimate Noisy gradient estimate 
Deterministic steps Random steps 


We can only compare the ensemble average behavior of LMS with the SDA. 





Linear prediction 

In Example 9.3.1, the AR(2) model given in (9.3.25) was considered, and the SDA was used to determine the 
corresponding linear predictor coefficients. We also analyzed the performance of the SDA. In the following example, 
we perform a similar acquisition of predictor coefficients using the LMS algorithm, and we study the effects of the 
eigenvalue spread of the input correlation matrix on the convergence of the LMS adaptive algorithm when it is used 
to update the coefficients. 
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EXAMPLE 9.4.1. The second-order system in (9.3.25) is repeated here, which generates the signal x(n) : 
x(n) +.a,x(n —1)+a,x(n —2) = @(n) 
where œ(n) ~ WGN(0,o2) and where the coefficients are selected from Table 9.2 for two different eigenvalue spreads. A 
Gaussian pseudorandom number generator was used to obtain 1000 realizations of x(n) using each set of parameter values given 
in Table 9.2. These sample realizations were used for statistical analysis. 
The second-order LMS adaptive predictor with coefficients e(n) = [c (n) c,(n)]" is given by [see (9.4.12)] 
e(n) = x(n) —c,(n—1)x(n—-1) —c,(n—2)x(n—2) n20 
c(n)=c(n-—1)+2y4e(n)x(n—1) 
c,(n) =c,(n—1) +2pe(n)x(n— 2) 


where yf is the step-size parameter. The adaptive predictor was initialized by setting x(—1)=x(-—2)=0 and 
c,(—1) = c,(—1) = 0. The above adaptive predictor was implemented with 4z = 0.04 , and the predictor coefficients as well as 
the MSE were recorded for each realization. These quantities were averaged to study the behavior of the LMS algorithm. These 
calculations were repeated for 4 = 0.01 . 

In Figure 9.20 we show several plots obtained for 7(R)=1.22. In plot (a) we show the ensemble averaged trajectory 
{c(n)}'5% superimposed on the MSE contours. A trajectory of a simple realization is also shown to illustrate its randomness. In 
plot (b) the e(n) learning curve for the averaged value as well as for one single realization is shown. In plot (c) the corresponding 
learning curves for the MSE are depicted. Finally, in plot (d) we show the effect of step size yz on the MSE learning curve. 
Similar plots are shown in Figure 9.21 for y(R) =10. 


Several observations can be made from these plots: 

e The trajectories and the learning curves for a simple realization are clearly random or “noisy,” while the averaging over the 
ensemble clearly has a smoothing effect. 

e The averaged quantities (coefficients and the MSE) converge to the true values, and this convergence rate is in accordance 
with theory. 

e The rate of convergence of the LMS algorithm depends on the step size {/ . The smaller the step size, the slower the rate. 

e The rate of convergence also depends on the eigenvalue spread y(R) . The larger the spread, the slower the rate. For 
Z(R) =1.22 the algorithm converges in about 150 steps while for 7(R)=10 it requires about 500 steps. 

Clearly these observations compare well with the theory. 
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FIGURE 9.20 
Performance curves for the LMS used in the linear prediction problem with step-size parameter 4s =0.04 and eigenvalue spread 
A(R) = 1.22. 
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FIGURE 9.21 
Performance curves for the LMS used in the linear prediction problem with step-size parameter 4s =0.04 and eigenvalue spread 


(R) = 10. 


Echo cancelation in full-duplex data transmission 

Figure 9.22 illustrates a system that achieves simultaneous data transmission in both directions (full-duplex) 
over two-wire circuits using the special two-wire to four-wire interfaces (called hybrid couplers) that exist in any 
telephone set. Although the hybrid couplers are designed to provide perfect isolation between transmitters and 
receivers, this is not the case in practical systems. As a result, (1) one part of the transmitted signal leaks through the 
near-end hybrid to its own receiver (near-end echo), and (2) another part is reflected by the far-end hybrid and ends 
up at its own receiver (far-end echo). The combined echo signal, which can be 30 dB stronger than the signal received 
from the other end, increases the number of errors. We note that in contrast with acoustic echo cancelation, the delay 


of echoes in data transmission is immaterial. 
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FIGURE 9.22 
Model of a full-duplex data transmission system that uses an echo canceler in the modems. 


The best way to address this problem is to form a replica of the echo and then subtract it from the incoming 
signal. We can model the echoes as the result of an “echo” path between the transmitter and the receiver. For 
baseband data transmission this echo path is basically linear and varies very slowly with time. Therefore, we can 
obtain a replica of the echo signal using an FIR LMS adaptive filter (echo canceler), as shown in Figure 9.22. The 
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inclusion of the transmitter in the echo path, as long as it involves linear operations, simplifies the implementation 
and improves the speed of adaptation because the input is an IID binary data sequence of values +1 and —1 with 
equal probability (Verhoeckx et al. 1979). 
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Referring to Figure 9.23, if we assume that the echo path has an FIR impulse response, the echo signal is given 
by 


FIGURE 9.23 
Block diagram of a system for investigating the performance of adaptive echo canceler. 


y(n) =e) x(n) (9.4.69) 


where c, =[c,(0) c (1) «++ ¢,(M -DF 


If g(n) is the impulse response of the transmission path from the far-end transmitter to the near-end receiver, the 
received signal is given by 


s,(n) = y(n)+z(n)+v(n) = y(n) + u(n) (9.4.70) 


an) = g(k)s(n—k) 
k=0 


where s(n) is the transmitted data signal and v(n)~ WGN(0,07) is additive noise. The signal u(n) = z(n)+v(n) 
represents the “uncancelable” signal because it cannot be removed by the canceler. 
The LMS adaptive echo canceler is given by 


$(n) =e" (n—-1)x(n) (9.4.71) 
e(n) = y(n)— (n) (9.4.72) 
c(n) =c(n—1)+2Me(n)x(n) (9.4.73) 


where s is the adaptation step size. The adaptive filter takes advantage of the fact that x(n) is correlated with 
y(n) but uncorrelated with s(n) and v(n). 
The residual (uncanceled) echo is 

e,(n) = y(n) — $(n) =[e, —e(n 1)" x(n) 5 -E7(n -1) x(n) (9.4.74) 

and if we assume that ¢(n—1) and x(n) are independent, then 
P(n) = E{e;(n)} = E{e"(n—-NE(n-1)} 
because R=E {x(n)x"(n)} = I . Using (9.4.69), (9.4.71), and (9.4.72), we can easily show that 
€(n) = €(n—1) —2px(n)x" (n)é(n—1) + 2ux(n)u(n) (9.4.75) 


If we premultiply (9.4.75) by its transpose and take the mathematical expectation, we obtain 
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P(n+1)=(1-4u+4u°M)P.(n)+4uMor (9.4.76) 


using the independence assumption and the relation x’(n)x(n)=M. The solution of (9.4.76), in terms of the 
residual echo ratio P.(n)/o2, is 
Bo (1-4u+4u’M)" ae rae. (9.4.77) 
D, 1-—uM | 1- uM 


u u 


and describes completely the operation of the LMS adaptive echo canceler. Indeed, we draw the following 
conclusions: 
1. The algorithm converges if 


|1-4u+4#M|<1 o O<ywx< ` (9.4.78) 


which agrees with (9.4.40) because tr R =M . 
2. After convergence we have 
P, (=) =m o? =uMo- (9.4.79) 
1— uM 
which again is in agreement with (9.4.63). 
3.If P.(n)/o2 > UM/(1— uM) ,we have 
P(n) 
P,(0) 
which can be used to find out how many iterations are required for a given echo reduction. For example, we can 
easily show that to achieve a 20-dB echo reduction requires m» ~1.15/y iterations. 
From the previous discussion, it should be clear that the step size 4 plays a crucial role in the performance of the 
adaptive echo canceler because it determines both the rate of convergence and the minimum residual echo cancelation 
that can be attained. Furthermore, we clearly see the tradeoff between fast adaptation and residual echo power. 


=(1-4u+4y°M)" (9.4.80) 


EXAMPLE 9.4.2. Consider the system shown in Figure 9.23 for investigating the performance of the LMS algorithm in adaptive 
echo cancelation and to verify the above conclusions. The data generators A (in modem A) and B (in modem B) output symbols +1 
or —1 with equal probability (i.e., Bernoulli sequence). The FIR filter following data generator A models the echo path, which is 


assumed to be 
cin) =-5{ 5] +(4] O<n<M -1 
3\2 3\5 


where M = 20 is the total length of the echo path. The filter following data generator B models the transmission path between 
the far-end transmitter and the near-end receiver, which we will assume to be 


4 3n 
g(n)=—(=)"_ n20 
5 5 
The noise generator is a white Gaussian source with o? =] and models the transmission noise. Using the equations 


N-l co 
=) cr(k) and g? => g°(k)+o? »wescale u(n) sothat 10log(o; / o7) = 30 dB. The 
"k=O k=0 i 


adaptive echo canceler employs the LMS algorithm with ¢ (0) = 0 . We perform Monte Carlo simulations on this system. Figure 
9.24 shows the residual echo ratio P.(n)/o?2 evaluated by ensemble averaging over 200 independent trials of the experiment, for 
two different step sizes in the LMS algorithm [which satisfy (9.4.78)], superimposed on the corresponding theoretical curves 
computed by using (9.4.79) and (9.4.80). Clearly, the simulations support the theoretical results quite accurately. More detailed 
discussions of adaptive echo cancelation techniques for both baseband and passband data transmission systems can be found in 
Gitlin et al. (1992) and in Ling (1993a). 


Adaptive equalization 
When data are transmitted below 2400 bits/s, the ISI is relatively small and does not pose a problem in the 
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operation of a modem. However, for high-speed communication over 2400 bits/s, an equalizer is needed in the 
modem to compensate for the channel distortion. Since channel characteristics are generally unknown and 
time-varying, an adaptive algorithm is required that leads to adaptive equalization. Figure 9.25 describes an 
application of adaptive filtering to adaptive channel equalization. Initially, coefficients of the equalizer are adjusted, 
by means of the LMS algorithm, by transmitting a known training sequence of short duration. After this short training 
period, the actual data sequence {y(n)} is transmitted. The slow variation in channel characteristics is then 
continuously tracked by adjusting coefficients of the equalizer, using the decisions in place of the known training 


sequence. This approach works well when decision errors are infrequent. 


Residual echo ratio 
35 T T 
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FIGURE 9.24 
Performance analysis of the LMS algorithm in the adaptive echo cancelation that clearly shows the tradeoff between rate of 
convergence and residual echo power. 





Channel 
Equalizer 
Transmitter Receiver with 
detector 
Data Received 
sequence data 


(a) 






EA 


kog 


kog an) 









zE 
ae | 


(6) 
FIGURE 9.25 
Model of an adaptive equalizer in a data transmission system. 


EXAMPLE 9.4.3. Figure 9.26 shows the block diagram of the system used in the experimental investigation of the performance of 
the LMS algorithm used in the adaptive equalizer. The data source generates Bernoulli sequence { y(n)} with symbols +1 and —1 
having zero mean and unit variance. The channel following the source is modeled by the raised cosine impulse response 


2a 
kine osf + Cos [Zo -»|} n=1, 2) 3 (9.4.81) 


0 otherwise 


where parameter W is used to control the amount of channel distortion. The amount of channel distortion increases with W. The 
random noise generator outputs white Gaussian sequence y(n) which models the noise in the channel. The equalizer input is 
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x(n) = 5 h(k)y(n—k)+v(n) (9.4.82) 
k=l 


Since y(n) is an independent sequence and since y(n) is uncorrelated with y(n), the maximum lag that produces nonzero 
correlation is 2. Thus the correlation of x(n) is given by 

r (0)= h? (1)+h?°(2)+h’(3)+ o? 

r,()=h(Dh(2)+h(2)h(3) 

r,(2) = h(hGB) 
from which an M xM autocorrelation matrix R can be constructed for an equalizer of length M. Clearly, parameter W also 
controls the eigenvalues of R and hence the ratio y(R). Here we study the performance of the corresponding LMS adaptive 


equalizer. 
yan) = y(n - D) 
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FIGURE 9.26 


Block diagram of a system for investigating the performance of an adaptive equalizer. 


The training signal y(n) is delayed by an amount equal to the combined delay introduced by the channel and the equalizer for 
the desired signal. The impulse response h(n) in (9.4.81) is symmetric with respect to n = 2 , and assuming that the equalizer is 
a linear-phase FIR filter, the total delay is equal to A=(M —1)/2+2. The error signal e(n) = y(n—A)-— p(n) is used along 
with x(n) to implement the LMS algorithm in the adaptive equalizer with ¢(0)=0. We performed Monte Carlo simulations 
using 100 realizations of random sequences with M =11;A=7;07 =0.001; W =2.9 and W =3.5; and u =0.01, 0.04, 
and 0.08 The results are shown in Figures 9.27 and 9.28. 
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FIGURE 9.27 
Performance analysis curves of the LMS algorithm in the adaptive equalizer: 4 = 0.04. 
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FIGURE 9.28 
MSE learning curves of the LMS algorithm in the adaptive equalizer: W = 2.9. 


Effect of eigenvalue spread. Performance plots of the LMS algorithm for W =2.9 and W =3.5 are shown in Figure 
9.27. In plot (a) we depict MSE learning curves from which we observe that the convergence rate of the MSE decreases with W [or 
equivalently with increase in y(R) ], which is to be expected. The steady-state error, on the other hand, increases with W. In plots 
(b) and (c) we show the ensemble averaged equalizer coefficients. Clearly, the responses are symmetric with respect to n=5 as 
assumed. Also equalizer coefficients converge to different inverses due to changes in the channel characteristics. 


Effect of step size 4. In Figure 9.28 we show the MSE learning curves obtained for W = 2.9 and with three different 
step-size parameter values of 0.01, 0.04, and 0.08. It indicates that jy affects the rate of convergence as well as the steady-state 
value. For 4 =0.08, the algorithm converges in about 100 iterations but has higher steady-state value than the case for 
=0.04, which requires about 275 iterations for convergence. For 44=0.01 more than 500 iterations are needed. Finally, 
Figure 9.29 shows sample realizations of the transmitted, received, and equalized sequences using the discussed LMS equalizer. 


1 
0 
=1 


y(n) 


0 100 200 300 400 500 
n 


(a) Transmitted sequence 








(b) Received sequence 


y(n) 
So: = 


0 100 200 300 400 500 
n 


(c) Equalized sequence 
FIGURE 9.29 
Sample realizations of the transmitted, received, and equalized sequences using an FIR LMS equalizer. 
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9.4.5 Some Practical Considerations 


The LMS is the most widely known and used adaptive algorithm because of its simplicity and robustness to 
disturbances and model errors. We next discuss some issues related to its robustness, finite-word-length effects, and 
implementation. 


Robustness 

If we assume the model in Figure 9.19, an adaptive filter is said to be robust if the effect of the disturbances 
{c(—1),e,(n)} on the resulting estimation errors {€(n), e(n)} (or {€(n), €(n)}), as measured by their energy, is 
small (Sayed and Rupp 1998). Basically a robust adaptive filter should be insensitive to the initial conditions e(—1) 
and the optimum residual error e,(n), which acts as measurement noise. These inputs are collectively called 
disturbances. In practice, e,(n) accounts not only for measurement noise but also for model mismatching, 
quantization errors, and other inaccuracies. 

If we define the energies of the disturbances and the estimation errors by 


En) =H EED + dle, P (9.4.83) 
2u j=0 

and E na(n) =+ lE? + | MON (9.4.84) 
24 j=0 


it can be shown that the coefficient vectors determined by the LMS algorithm satisfy the condition 


Enor (M) S E gig. (n) (9.4.85) 
assuming that 0 < 24 < 1/||x(n)| (Sayed and Kailath 1994; Sayed and Rupp 1996). Equation (9.4.85) shows 
that the energy of the residuals is always upper-bounded by the energy of the disturbances, which explains the robust 
behavior of the LMS algorithm. 

Furthermore, it can be shown that the LMS algorithm minimizes the maximum possible difference between 
these two energies, over all disturbances with finite energy, and is optimum according to the H” (or minimax) 
criterion (Sayed and Rupp 1998; Hassibi et al. 1996). 


Finite-precision effects 

When we design an LMS adaptive filter for a stationary SOE, we choose the step size {/ to provide the desired 
balance between speed of convergence and misadjustment. If we are not concerned about fast convergence, we can 
reduce ££ so much as to obtain practically insignificant misadjustment. However, in a digital implementation, the 
adaptation of the LMS algorithm stops (stalls) when the correction term becomes smaller in magnitude than one-half 
of the least significant bit (LSB), that is, 


| 21e"(n)x(n—k) |S = (9.4.86) 


Therefore, a decrease in 4 may result in a performance degradation, unless we increase the number of bits (i.e., the 
precision) of the filter coefficients. If X,,,, is the root mean square (rms) amplitude of the input signal, to a good 
approximation we have 


| e(n) |$ ——— = DRE (9.4.87) 


where DRE is known as the digital residual error (Gitlin et al. 1973). We note that for a given number of bits the 
DRE increases as we decrease the step size 4. 

The roundoff numerical errors contribute to the steady-state EMSE a term that is inversely proportional to 4, 
whereas the quantization of the input data and the filter output contributes a second term that is independent of the 
step size (Caraiscos and Liu 1984). Hence, in practice the step size of the LMS algorithm cannot be decreased below 
the level where the degradation effects of quantization and finite-precision arithmetic become significant. Also, the 
finite-precision effects become more pronounced as the ill conditioning of the input increases (Alexander 1987). 

When one or more eigenvalues of the input correlation matrix are zero, the corresponding adaptation modes 
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either do not converge or may result in overflow due to nonlinear quantization effects (Gitlin et al. 1982). These 
effects can be prevented by using a technique known as leakage. The leaky LMS algorithm is given by 


c(n) =(1—y pe(n—1) + pe*(n) x(n) (9.4.88) 


where y is the leakage coefficient. Since { and y are very small positive constants, 1— y 4 is slightly less 
than 1. The updating (9.4.88) is obtained by minimizing the cost function 


P(n) =| e(n) P +ylle (9.4.89) 


which includes a penalty term proportional to the size of the coefficient vector. The price of leakage is an increase in 
computational complexity and some bias in the obtained estimates (see Problem 9.17). More details and practical 
applications of the leaky LMS algorithm to adaptive equalization are discussed in Gitlin et al. (1992, 1982). 

We can simplify the hardware implementation of LMS adaptive filters by using nonlinearities to avoid the 
multiplications involved in the updating of the filter coefficients. These simplified LMS algorithms update the filter 
coefficients by using quantized correction terms such as s£sign{e(n)}x(n—k), ue(n)sign{x(n—k)}, or 
sign{e(n)x(n—k)}; and their performance is degraded by the lower precision. Various signum-based LMS 
adaptive algorithms are discussed in Claasen and Mecklenbrauker (1981), Duttweiler (1982), and Treichler et al. 
(1987). 


Transform-domain and block LMS algorithms 

The LMS algorithm attains its best rate of convergence when the input correlation matrix is diagonal with equal 
eigenvalues. In the case of FIR filters, this implies that the input signal is white noise. When the components of the 
input data vector are correlated, we can improve the convergence by using an isotropic decorrelating transformation, 
as shown in Figure 9.30. The transformation matrix can be obtained by using either the triangular or the orthogonal 
decomposition of the input correlation matrix as explained in Section 2.3. Since the innovations vector used by the 
LMS algorithm has uncorrelated components with unit variance, the error performance surface is a hypersphere, and 
the transform-domain LMS algorithm attains its best rate of convergence. In practice, when the input correlation 
matrix is unknown and possibly time-varying, we can only use suboptimum transforms such as the DFT, the discrete 
cosine transform (DCT), the discrete wavelet transform (DWT), or some other orthogonal transform. The 
performance of the obtained adaptive filter depends on the decorrelation properties of the transform, which in turn 
depends on the properties of the input correlation matrix. Another approach to overcome the problem of slow 
convergence for highly correlated inputs is found in the family of affine projection algorithms discussed in Ozeki and 
Umeda (1984), Rupp (1995), and Morgan and Kratzer (1996) and the references therein. 





FIGURE 9.30 
Transform domain LMS adaptive filter structure. 


In applications that require adaptive filters with a very large number of coefficients, real-time implementation of 
the LMS algorithm becomes quite involved. For example, acoustic echo cancelers with 8000 coefficients (500 ms 
sampled at 16 kHz) are typical for teleconference applications (Gilloire et al. 1996). The complexity of such 
applications can be reduced by using block adaptive filters (see Figure 9.31) that process one block of data at a time 
in either the time or the frequency domain. The adaptive filter coefficients are updated once per block and are kept 
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fixed within the block. Such filters have good numerical accuracy, and can be easily pipelined and parallelized, and 
their complexity can be reduced by computing the involved convolutions and correlations using FFT algorithms. In 
some applications, such as acoustic echo cancelation, the block-length delay introduced by these filters may create 
problems. A detailed treatment of block and frequency-domain LMS algorithms is given in Shynk (1992), Gilloire et 
al. (1996), Haykin (1996), Jenkins and Marshall (1998), and Treichler et al. (1987). 
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FIGURE 9.31 
Block adaptive filter structure. 


Another approach to reduce complexity and improve convergence is subband adaptive filtering, which splits the 
input signal and the desired response into smaller frequency bands (subbands), subsamples the resulting signals, 
processes each subband with different LMS filters, and finally interpolates and recombines the subbands to obtain the 
filter output (Shynk 1992; Gilloire and Vetterli 1992). The improved convergence results because the spectral 
dynamic range of each subband is smaller than that of the full band. However, the performance of subband adaptive 
filters is degraded by the cross-talk between adjacent subbands. 


9.5 Recursive Least-Squares Adaptive Filters 


In this section we use the method of LS to develop adaptive filters, we determine their rate of convergence and 
misadjustment, and we introduce the conventional recursive least-squares (CRLS) algorithm for their implementation. 
The CRLS algorithm does not impose any restrictions on the input data vector; therefore, it can be used for both array 
processing and FIR filtering applications. 


9.5.1 LS Adaptive Filters 


LS adaptive filters are designed so that the updating of their coefficients always attains the minimization of the total 
squared error from the time the filter initiated operation up to the current time. Therefore, the filter coefficients at time 
index n are chosen to minimize the cost function 


Eln) =A" eA P= DA | yD -eB xP (9.5.1) 
j=0 j=0 
where e(j) is the instantaneous error and the constant 1, O< Å <1, is the forgetting factor. Note that since the 
filter coefficients are held constant during the observation interval 0 < j < n, then a priori and a posteriori errors 
are identical. The coefficient vector obtained by minimizing (9.5.1) is denoted by e(n) and provides the optimum 
LSE filter at time n. When /A=1, we say that the algorithm has growing memory because the values of the filter 
coefficients are a function of all the past input values. The forgetting factor (see Figure 9.32) is used to ensure that 
data in the distant past are paid less attention (“forgotten”) in order to provide the filter with tracking capability when 
it operates in a varying SOE (see Section 9.8). 
The filter coefficients that minimize the total squared error (9.5.1) are specified by the normal equations 


R(n)c(n) =d(n) (9.5.2) 
where Ê(n) Ê yar x(j)x"(j) (9.5.3) 


and din) EEA xy") 0.5.4) 
j-0 
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FIGURE 9.32 
Exponential weighting of observations at times n and n+1. Older data are more heavily discounted by the algorithm. 


provide exponentially weighted estimates of the input correlation matrix and the crosscorrelation vectorvetween input 
and desired response due to the presence of A”! in the cost function (9.5.1). The minimum total squared error is 


E na(n) = E,(n)- â” (n)e(n) (9.5.5) 
where E E,(n)2 a"! | yP (9.5.6) 
j=0 


is the energy of the weighted desired response signal. These formulas have been derived in Section 7.2.1. 

Suppose now that we wait for some n>M , where R(n) is usually nonsingular, we compute R(n) and 
d(n), and then we solve the normal equations (9.5.2) to determine the filter coefficients c(n). This approach, 
which is time-consuming, should be repeated with the arrival of new pairs of observations {x(n), y(n)}, that is, at 
times n+1, n+2, etc. 

A first reduction in computational complexity can be obtained by noticing that (9.5.3) can be expressed as 


R(n) = AR(n-1) + x(n)x"(n) (9.5.7) 


which shows that the “new” correlation matrix R(n) can be updated by weighting the “old” correlation matrix 
Rin —1) with the forgetting factor A and then incorporating the “new information” x(n)x” (n). Since the outer 
product x(n)x"(n) is a matrix of rank 1, (9.5.7) provides a rank 1 modification of the correlation matrix. Similarly, 
using (9.5.4), we can show that 


d(n) = Ad(n-1)+x(n)y’(n) (9.5.8) 


which provides a time update of the cross-correlation vector. 

We next show that using these two updatings, we can determine the new coefficient vector c(n) from the old 
coefficient vector c(n—1) and the new observation pair {x(n), y(n)} without solving the normal equations (9.5.2) 
from scratch. 


A priori adaptive LS algorithm. If we solve (9.5.7) for R(n—1) and (9.5.8) for d (n—1) and use the normal 
equations (9.5.2), we have 


[R(n) — x(n) x" (n)Je(n -1) = d(n) — x(n) y"(n) 
or after some simple manipulations 
R(n)e(n-1) + x(n)e"(n) =d(n) (9.5.9) 


where e(n) = y(n)-c"(n—1)x(n) (9.5.10) 
is the a priori estimation error. If the matrix R(n) is invertible, by multiplying both sides of (9.5.9) by j '(n) and 
using (9.5.2), we obtain 


c(n—1)+ R (n)x(n)e"(n) = R (n)d(n) =c(n) (9.5.11) 
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If we define the adaptation gain vector g(n) by 
R(n)g(n) È x(n) (9.5.12) 
Equation (9.5.11) can be written as 
c(n) =c(n—-1)+g(n)e'(n) (9.5.13) 
which shows how to update the old coefficient vector c(n—1) to obtain the current vector e(n). 


EXAMPLE 9.5.1. It is instructive at this point to derive the LS adaptive filter with a single coefficient. Indeed, since for M = | the 
correlation matrix R(n) becomes the scalar E,(n), we obtain 


E,(n) = AE, (n-1)+|x(n)|" 
e(n) = y(n)—c"(n—-1)x(n) 





c(n) =c(n—-1)+ x(n)e* (n) 


which is like an LMS algorithm with time-varying gain s(n) = 1/E,(n). However, the present algorithm is optimum in the LS 
sense. 


A posteriori adaptive LS algorithm. If we substitute (9.5.7) and (9.5.8) into the normal equations (9.5.2), after 
some simple manipulations, we obtain 


AR(n—1)e(n) — x(n)e"(n) = Ad(n—-1) (9.5.14) 


where &(n) = y(n) —e"(n)x(n) (9.5.15) 
is the a posteriori estimation error. If the matrix R(n —1) is invertible, (9.5.14) gives 


c(n)—A" B\(n-Dx(n)e'(n) = R'(n—-Dd(n-1) =c(n-1) 
or c(n) =c(n—-1)+gB(n)Ee*(n) (9.5.16) 
or AR(n-1)g(n) = x(n) (9.5.17) 


determines the alternative adaptation gain vector g(n). 
Since recursions (9.5.15) and (9.5.16) are coupled, the a posteriori algorithm is not applicable. However, if we 
substitute (9.5.16) into (9.5.15), we obtain 


eln) = y(n)—[c"(n-1) + €(n)g" (n) x(n) 
= e(n)—e(n)g"(n)x(n) 





or An) = £2 (9.5.18) 
a(n) 
where @(n) 21+ g"(n)x(n) =1+A'x" (n)R (n—-1)x(n) (9.5.19) 


is known as the conversion factor. Hence, we can use (9.5.19) and (9.5.18) to compute the a posteriori error €(n) 
before we update the filter coefficient vector. This trick makes possible the realization and use of the a posteriori LS 
adaptive filter algorithm. If R(n—1) is positive definite, we have @(n)>1 and |e(n) |<|e(n)| for all n. 
Therefore, 


Plemp PY len) (9.5.20) 
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which should be expected? because the adaptive filter is designed by minimizing, at each time n, the total squared a 
posteriori error €(n). 
Also, from (9.5.13), (9.5.16), and (9.5.18) we obtain 


g(n)= gin) (9.5.21) 





which shows that the two adaptation gains have the same direction but different lengths. However, from (9.5.13) and 
(9.5.16) we see that the corrections g(n)e*(n) and g(n)e*(n) are equal. 
Another conversion factor, defined in terms of the gain vector g(n), is 


a(n) ê 1-x"(n)p (n)x(n) =1—x"(n)g(n) (9.5.22) 
and has some interesting interpretations. Using (9.5.21), we have 


H — 
itn) =i1-= (n)g (n) 
a(n) 


a(n)@(n) = &(n)+1—-[1+x"(n)g(n)]=1 


an= 
or an) (9.5.23) 
which shows that the two conversion factors are inverses of each other. Since the input correlation matrix is 
nonnegative definite, that is, x " (n) R (n)x(n) > 0, (9.5.22) implies 

0<a(n) <1 (9.5.24) 
that is, the conversion factor @(n) is bounded by 0 and 1. This bound allows the interpretation of @(n) as an 
angle variable (Lee et al. 1981), and its monitoring can provide information about the proper operation of RLS 
algorithms. Also the quantity 1—@(n) can be interpreted as a likelihood variable (Lee et al. 1981). It can be shown 
(see Problem 9.23) that 


u det R(n—1) 


A (9.5.25) 
det R(n) 


a(n)=A 


which shows the importance of @(n) or @(n) for the invertibility for the estimated correlation matrix. 
The computational organization of the a priori and a posteriori LS adaptive algorithms is summarized in Table 
9.5. 


TABLE 9.5 
Summary of a priori and a posteriori LS adaptive filter approaches, 0 
A priori LS adaptive filter A posteriori LS adaptive filter 

Correlation matrix R(n) = AR(n—-1) + x(n)x"(n) R(n) = AR(n-1) + x(n)x"(n) 
Adaptation gain R(n)g(n) = x(n) AR(n-1)g(n) = x(n) 
A priori error e(n) = y(n)—e"(n—1)x(n) e(n) = y(n) —e"(n—1)x(n) 
Conversion factor a(n) =1-g"(n)x(n) @(n) =1+ g"(n)x(n) 
A posteriori error &€(n) = a(n)e(n) €(n) = e(n) 

a(n) 
Coefficient updating c(n) =c(n—1)+ g(n)e*(n) c(n) =c(n—-1)+ g(n)E"(n) 


©The computation of the quantity A" | yi) -e"x(j) P for c =e(n),c(j), or ¢(j—1) gives the block, a posteriori, or a priori total 
jo 


squared error. Clearly, only the block filter performs optimum LS filtering for all data in the interval O< j <n (see Problem 10.22). 
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Figure 9.33 shows a block diagram representation of the a priori LS adaptive filter. There are two important 
points to be made: 
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Jn) = c#(n — 1)x(n) 


e(n) = e(n — 1) + g(n)e*(n) 
Coefficient updating 


R(n)g(n) = x(n) 
Gain vector 
computation 


—--- - ee ee eee eee Kd 


Adaptive algorithm 


FIGURE 9.33 
Basic elements of the a priori LS adaptive filter. Note that the filtering process has no effect on the computation of the gain vector. 


e The adaptation gain is strictly a function of the input signal. The desired response only affects the magnitude and 
sign of the coefficient correction term through the error. 

e The most demanding computational task in RLS filtering is the computation of the adaptation gain. This involves 
the solution of a linear system of equations, which requires O(M*) operations per time update. 


9.5.2 Conventional Recursive Least-Squares Algorithm 


The major computational load in LS adaptive filters, that is, the computation of the gain vectors 


g(n)=R (n)x(n) (9.5.26) 


or 2(n)=/'R (n-1)x(n) (9.5.27) 
can be reduced if we can find a recursive formula to update the inverse 
P(n) R(n) (9.5.28) 


of the correlation matrix. We can develop such an updating by using the rank 1 updating (9.5.7) and the matrix 


inversion lemma 


_ (ATR x)ATR'x)" 


A H -1 — ARĪ! 
(AR+xx ) 1+ Ax" Rox 


(9.5.29) 


discussed in Appendix A. 
Indeed, using (9.5.29), (9.5.7), (9.5.26), and (9.5.19), we can easily show that 


P(n)=A'P(n-1)-g(n)g"(n) (9.5.30) 
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which provides the desired updating formula. Indeed, given the old matrix P(n—1) and the new observations 
{x(n), y(n)} we compute the new matrix P(n) , using the following procedure 


g(n) =A'P(n-1)x(n) 

a(n) =1+ 2"(n)x(n) 

gn) 

Q(n) 
P(n)=A'P(n-1)-g(n)g'(n) 


(9.5.31) 
g(n)= 


which is known as the conventional recursive LS (CRLS) algorithm. We again stress that the CRLS algorithm is valid 
for both linear combiners and FIR filters because it does not make any assumptions about the nature of the input data 
vector. However, for FIR filters we usually assume prewindowing, that is, x(—1) =0, or equivalently x(n)=0 for 
-M <n < -l. 


Updating of the minimum total squared error. We next derive an update recursion for the minimum total 
squared error (9.5.5). Using (9.5.6), we can easily see that 


E,(n) = AE, (n—-1)+ y(n)y' (n) (9.5.32) 


which provides a recursive updating for the energy of the desired response. Substituting (9.5.32) and (9.5.13) into 
(9.5.5), we obtain 


Ewn (n) = AE, (n—1) + y(n)y"(n)—d (ne(n—1)—g"(n)g(n)e"(n) 
or by using (9.5.8) 
Enn (n) = AE,(n—1) + y(n)y"(n)—g' (n)g(n)e"(n) 
— y(n)x"(n)e(n—1)—Ag"(n—-le(n-1) 
Rearranging the terms of the last equation and using (9.5.5), we have 
Emn (n) = ALE, (n—1)— g"(n—De(n—-D] +L y(n) -d"()g(n)le"(n) 


= AE pa (n -D +{ y(n) -A (DR (UR) g(n) Je" (n) 
= AE n(n —1) +[y(n) =c" (n)x(n)]e" (n) 


where the last equation is obtained because the matrix R(n) and its inverse are Hermitian. The last equation leads 
to 


E in (n) = AE, (n — 1) + €(nje" (n) (9.5.33) 
= AE, (n—-1)+ @(n)| e(n) f (9.5.34) 
2 
= AE,,,(n—1) pew (9.5.35) 
a(n) 


which provide the desired updating formulas. Since the product €(n)e’(n) is by necessity real, we have 
€(n)e’(n) =e" (n)e(n) . The value of Emin(n) increases with time and reaches a finite limit value only if 2 <1. 


9.5.3 Some Practical Considerations 


In the practical implementation of CRLS adaptive filters, we have to deal with the issues of computational complexity, 
initialization, and finite-word-length effects. 


Computational complexity. The complete CRLS algorithm is summarized in Table 9.6. A measure of the 
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computational complexity of the CRLS algorithm is provided by the number of operations (one operation consists of 
one multiplication and one addition) required to perform one updating. Since P(n) is Hermitian, it is possible to 
implement the algorithm so that it will require 2M*+4M_ operations per time updating. The computation of 
g,(n) and the updating of P(n) require O(M *) operations. In contrast, all remaining formulas, which involve 
dot products and vector-by-scalar multiplications, require O(M) operations. The inversion of the correlation matrix 
R(n) is essentially replaced by the scalar division used to compute g(n). 


TABLE 9.6 
Practical implementation of the RLS algorithm. To update P(7), we only compute its upper (low) triangular part and 
determine the other part using Hermitian symmetry. 

Initialization 





c(-1)=0 P(-l)=6'l 
6 = small positive constant 








Foreach n=0,1,2,--- compute: 
Adaptation gain computation 
gan) =P(n-Ix(n) œ (n)=4+g;(n)x(n) 





g(n) -Ea Pin) =A Pa- -gmg n] 


œ (n) 
Filtering 
e(n) = y(n)-c"(n-1)x(n) 
Coefficient updating 


c(n) =c(n—1)+ g(nje"(n) 











Initialization. There are two ways to obtain the values P(—1) and c(—1) required to initialize the CRLS 
algorithm. The most obvious way is to collect an initial block of data {x(n), y(n)}=),,,0 > M , and then compute 
the exact inverse matrix P(—1) and the exact LS solution e(—1). 

The approach used in practice is to set P(—1)=6'I , where 6 is a very small positive number (on the order 
of 0.0102) and c(—1)=0. For FIR filters this corresponds to setting x(-—M +1)= NE and x(n)=0 for 
-M +2 < n < —1. For any n>M, the normal equations matrix is 6A"J +R(n) and results in a biased 
estimate of e(n). However, for large n the choice of Ê is unimportant because the algorithm has exponentially 
forgetting memory for Å <1. 

It can be shown (see Problem 9.24) that this approach provides a set of coefficients that minimizes the modified 
cost function 


E(n) = 6A™"' |e 








“+A yG -e x (9.5.36) 
j=0 


instead of (9.5.1). Note that if we turn off the input, that is, we set x(n)=0, then (9.5.30) becomes 
P(n) = A`'P(n—1) , which is an unstable recursion when A <1. 


Finite-word-length effects. There are different RLS algorithms that are algebraically equivalent; that is, they 
solve the same set of normal equations. Therefore, they have the same rate of convergence and the same insensitivity 
to variations in the eigenvalue spread of the input correlation matrix with the CRLS algorithm. All RLS algorithms 
are obtained by exploiting exact mathematical relations between various algorithmic quantities to obtain better 
computational or numerical properties. Many of these algorithmic quantities have certain physical meanings or 
theoretical properties. For example, in the CRLS algorithm, the matrix P(n) is Hermitian and positive definite, the 
angle variable satisfies 0< a@(n) <1, and energy E(n) should be always positive. However, when we use finite 
precision, some of these exact relations, properties, or acceptable ranges for certain algorithmic variables may be 
violated. 
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The numerical instability of RLS algorithms can be traced to such forms of numerical inconsistencies 
(Verhaegen 1989; Yang and Böhme 1992; Haykin 1996). The crucial part of the CRLS algorithm is the updating of 
the inverse correlation matrix P(n) via (9.5.30). The CRLS algorithm becomes numerically unstable when the 
matrix P(n)= R (n) loses its Hermitian symmetry or its positive definiteness (Verhaegen 1989). In practice, we 
can preserve the Hermitian symmetry of P(n) by computing only its lower (or upper) triangular part, using (9.5.30), 
and then filling the other part, using the relation p,(n) = p;;(n). Another approach is to replace P(n) by 
[P(n)+P"(n)]/2 after updating from P(n—1) to P(n). 

It has been shown that the CRLS algorithm is numerically stable for 2 <1 and diverges for 2=1 (Ljung and 
Ljung 1985). 


9.5.4 Convergence and Performance Analysis 


The purpose of any LS adaptive filter, in a stationary SOE, is to identify the optimum filter c, = R™'d from 
observations of the input vector x(n) and the desired response 


y(n) =c;'x(n)+e,(n) (9.5.37) 


To simplify the analysis we adopt the independence assumptions discussed in Section 9.4.2. The results of the 
subsequent analysis hold for any LS adaptive filter implemented using the CRLS method or any other algebraically 
equivalent algorithm. We derive separate results for the growing memory and the fading memory (exponential 
forgetting) algorithms. 


Growing memory (A =1) 
In this case all the values of the error signal, from the time the filter starts its operation to the present, have the 


same influence on the cost function. As a result, the filter loses its tracking ability, which is not important if the filter 
is used in a stationary SOE. 


Convergence in the mean. For n>M the coefficient vector c(n) is identical to the block LS solution 
discussed in Section 7.2.2. Therefore 


E{e(n)}=c, forn>M (9.5.38) 
that is, the RLS algorithm converges in the mean for n >M , where M is the number of coefficients. 
Mean square deviation. For n> M we have 
@(n) = Z E{R '(n)} (9.5.39) 


because c(n) is an exact LS estimate (see Section 7.2.2). The correlation matrix R(n) is described by a complex 
Wishart distribution, and the expectation of its inverse is 





E{R '(n)}= -iR «sm (9.5.40) 
n—M 
as shown in Muirhead (1982) and Haykin (1996). Hence 
2 
@(n)=—"_R" n>M (9.5.41) 
n—M 
and the MSD is 
go &1 
D(n) = tr[®(n)] = —*~_ 9 — n>M (9.5.42) 
(n) = tr B(n)] = 2 F) 


where A, , the eigenvalues of R, should not be confused with the forgetting factor A. From (9.5.42) we conclude 
that (1) the MSD is magnified by the smallest eigenvalue of R and (2) the MSD decays almost linearly with time. 


A priori excess MSE. We now focus on the a priori LS algorithm because it is widely used in practice and to 
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facilitate a fairer comparison with the (a priori) LMS algorithm. To this end, we note that the a priori excess MSE 
formula (9.4.48) 


P (n) = tr[R®(n-1)] (9.5.43) 


derived in Section 9.4.2, under the independence assumption, holds for any a priori adaptive algorithm. Hence, 
substituting (9.5.41) into (9.5.43), we obtain 


M 2 
O, 
n-M -1 
which shows that P, (n) tends to zeroas noo. 


P (n)= n>M (9.5.44) 


Exponentially decaying memory (0<4Å<1) 

In this case the most recent values of the observations have greater influence on the formation of the LS estimate 
of the filter coefficients. The memory of the filter, that is, the effective number of samples used to form the various 
estimates, is about 1/(1—A) for 0.95<A<1 (see Section 9.8). 


Convergence in the mean. We start by multipying both sides of (9.5.11) by R(n), and then we use (9.5.7) and 
(9.5.10) to obtain 


R(n)e(n) = AR(n — l)e(n—1) + x(n) y*(n) (9.5.45) 
If we multiply (9.5.7) by c, and subtract the resulting equation from (9.5.45), we get 


R(n)é(n) = AR(n—1)é(n—-1) + x(n)e*(n) (9.5.46) 
where ¢(n)=c(n)—c, is the coefficient error vector. Solving (9.5.46) by recursion, we obtain 


é(n) = A" p(n) R(0)E(0) + FEOJ Aixe) (9.5.47) 
j=0 


which depends on the initial conditions and the optimum error e,(n). If we assume that R(n) , x(j),and e,(j) 
are independent and we take the expectation of (9.5.47), we obtain 

E{é(n)} = A" E{ R ‘(n)}E(0) (9.5.48) 
where, as usual, we have set R(O0)=61, S>0. If the matrix R(n) is positive definite and 


0<A<1, then the mean vector E{€(n)}—>0 as n— æ. Hence, the RLS algorithm with exponential forgetting 
converges asymptotically in the mean to the optimum filter. 


Mean square deviation. Using (9.5.46), we obtain the following difference equation for the coefficient error 
vector 


é(n) = AR '(n)R(n—-1é(n—-1) + R` (n)x(n)e (n) 


or č(n) = A(n —1)+ R`'(n)x(n)e: (n) 


because Rn) Ên-1) =I for large n. If we neglect the dependence among é(n—1), R(n), x(n), and 
e,(n), we have 


b(n) = V@O(n-1)+ CE{R '(n)x(n)x"(n)R '(n)} (9.5.49) 


where 07 = E{| e (n) }}. 
To make the analysis mathematically tractable, we need an approximation for the inverse matrix R(n) . To 
this end, using (9.5.3), we have 


E{R(n)} = F A Etxin" mn- = 
j=0 


= — R 9.5.50 
1-A 1-A 


where the last approximation holds for n œ 1. If we use the approximation E {R(n)} = R(n), we obtain 
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R'(n)=(1-A)R" (9.5.51) 
which is more rigorously justified in Eleftheriou and Falconer (1986). Using the last approximation, (9.5.50) becomes 
(n) = V7@(n-1)+(1-AYooR" (9.5.52) 
which converges because AP 21. At steady state we have 
(1—A7)@(0c) = (1-4) oR" 


because ®(n) = Ø(n—1) for n>1. Hence 


1-A 
@(cc) = —/ oR" 9.5.53 
(°) lea? ; : 
d theref D (ce) =tr[Ð()] l-A pS (9.5.54) 
an ererore o)= Cyl=7_7%o. i a J 
A iad. “S24, 


which in contrast to (9.5.42) does not converge to zero as n — œ. This is explained by noticing that when A <1, 
the RLS algorithm has finite memory and does not use effectively all the data to form its estimate. 


Steady-state a priori excess MSE. From (9.5.43) and (9.5.53) we obtain 


P (°) = tr[ R®(co)] = 1-4 agg? (9.5.55) 
1+A 


which shows that as a result of finite memory, there is a steady-state excess MSE that decreases as A approaches 1, 
that is, as the effective memory of the algorithm increases. 


Summary 

The results of the above analysis are summarized in Table 9.7 for easy reference. We stress at this point that all 
RLS algorithms, independent of their implementation, have the same performance, assuming that we use sufficient 
numerical precision (e.g., double-precision floating-point arithmetic). Sometimes, RLS algorithms are said to have 
optimum learning because at every time instant they minimize the weighted error energy from the start of the 
operation (Tsypkin 1973). These properties are illustrated in the following example. 


TABLE 9.7 
Summary of RLS and LMS performance in a stationary SOE. 

Growing memory Exponential memory 
Property RLS algorithm RLS algorithm LMS algorithm 
Convergence in the mean For all n>M Asymptotically for Asymptotically for 

n— œ n — œ 

Convergence in MS Independent of the Independent of the Depends on the 

eigenvalue spread eigenvalue spread eigenvalue spread 

2 

Excess MSE P.(n)= ~ ae 70 P(e) = Limo P (œ) = uo} tr R 








EXAMPLE 9.5.2. Consider the adaptive equalizer of Example 9.4.3 shown in block diagram form in Figure 9.26. In this 
example, we replace the LMS block in Figure 9.26 by the RLS block, and we study the performance of the RLS algorithm and 
compare it with that of the LMS algorithm. The input data source is a Bernoulli sequence {y(n)} with symbols +1 and —1 having 
zero mean and unit variance. The channel impulse response is a raised cosine 


2m 
h= osi +cos |= (n- 2} n=1,2,3 (9.5.56) 


0 otherwise 


where the parameter W controls the amount of channel distortion [or the eigenvalue spread y(R) produced by the channel]. The 
channel noise sequence y(n) is white Gaussian with g? =0.001. The adaptive equalizer has M =11 coefficients, and the 
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input signal y(n) is delayed by A=7 samples. The error signal e(n)= y(n—A)—}(n) is used along with x(n) to 
implement the RLS algorithm given in Table 9.6 with ¢(0)=0 and ô= 0.001. We performed Monte Carlo simulations on 100 
realizations of random sequences: with W =2.9 and W =3.5, and A=1 and 0.8 The results are shown in Figures 9.34 and 
9.35. 


Effect of eigenvalue spread. Performance plots of the RLS algorithm for W =2.9 and W =3.5 are shown in Figure 
9.34. In plot (a) we depict MSE learning curves along with the steady-state (or minimum) error. We observe that the MSE 
convergence rate of the RLS, unlike that for the LMS, does not change with W [or equivalently with change in y(R) ]. The 
steady-state error, on the other hand, increases with W. The important difference between the two algorithms is that the 
convergence rate is faster for the RLS (compare Figures 9.34 and 9.27). Clearly, this faster convergence of the RLS algorithm is 
achieved by an increase in computational complexity. In plots (b) and (c) we show the ensemble averaged equalizer coefficients. 
Clearly, the responses are symmetric with respect to n = 5, as assumed. Also equalizer coefficients converge to different inverses 
due to changes in the channel characteristics. 
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FIGURE 9.34 
Performance analysis curves of the RLS algorithm in the adaptive equalizer: 2 =1. 


Effect of forgetting factor 4. In Figure 9.35 we show the MSE learning curves obtained for W =2.9 and with two 
different factors of 1 and 0.8. For 2=1, as explained before, the algorithm has infinite memory and hence the steady-state 
excess MSE is zero. This fact can be verified in the plot for 4=1 in which the MSE converges to the minimum error. For 
A=0.8, the effective memory is 1/(1—A)=5, which clearly is inadequate for the accurate estimation of the required statistics, 
resulting in increased excess MSE. Therefore, the algorithm should produce a nonzero excess MSE. This fact can be observed from 
the plot for 2=0.8. 

There are two practical issues regarding the RLS algorithm that need an explanation. The first issue relates to the practical 
value of 4. Although A can take any value in the interval 0 < Å <1, since it influences the effective memory size, the value of 
A should be closer to 1. This value is determined by the number of parameters to be estimated and the desired size of the effective 
memory. Typical values used are between 0.99 and 1 (not 0.8, as we used in this example for demonstration). The second issue 
deals with the actual computation of matrix P(n). This matrix must be conjugate symmetric and positive definite. However, an 
implementation of the CRLS algorithm of Table 9.6 on a finite-precision processor will eventually disturb this symmetry and 
positive definiteness and would result in an unstable performance. Therefore, it is necessary to force this symmetry either by 
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computing only its lower (or upper) triangular values or by using P(n) <[P(n)+P"(n)]/2. Failure to do so generally affects 
the algorithm performance for <1. 


MSE learning curve 
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FIGURE 9.35 
MSE learning curves of the RLS algorithm in the adaptive equalizer: W = 2.9. 


9.6 Fast RLS Algorithms for FIR Filtering 


In Section 6.3 we exploited the shift invariance of the input data vector 


=| Xal) |- xo | (9.6.1) 


x(n-m) x,,(n—1) 


to develop a lattice-ladder structure for optimum FIR filters and predictors. The determination of the optimum 
parameters (see Figure 6.3) required the LDL” decomposition of the correlation matrix R(n) and the solution of 
three triangular systems at each time n . However, for stationary signals the optimum filter is time-invariant, and the 
coefficients of its direct or lattice-ladder implementation structure are evaluated only once, using the algorithm of 
Levinson. 


The key for the development of order-recursive algorithms was the following order partitioning of the 
correlation matrix 


(9.6.2) 


R=] tm (M) Be ga 


r" (n) P(n-m)| |rf(n) R,(n-1) 


which is a result of the shift-invariance property (9.6.1). The same partitioning can be obtained for the LS correlation 
matrix ĝ„(n) 


na(n) F > A ae epee (j) 
j=0 


(9.6.3) 
aie faln) Pee ul 


Pu(n) E(n—m)| (PAm) RK, (2-1) 


if we assume that x, (—1) =0, a condition known as prewindowing (see Section 7.3). This condition is neccesary to 


ensure the presence of the term ,(m—1) in the lower right corner partitioning of R,,(7) - 
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The identical forms of (9.6.2) and (9.6.3) imply that the order-recursive relations and the lattice-ladder structure 
developed in Section 6.3 for optimum FIR filters can be used for prewindowed LS FIR filters. Simply, the 


n “3 
expectation operator E{()} should be replaced by the time-averaging operator 24" 7(), and the term power 
J= 


should be replaced by the term energy, when we go from the optimum MSE to the LSE formulation. 
In this section we exploit the shift invariance (9.6.1) and the time updating 


Ê„(n) =AR, (0-1) + x,,(n) x} (n) (9.6.4) 


to develop the following types of fast algorithms with O(M) complexity: 

1. Fast fixed-order algorithms for RLS direct-form FIR filters by explicitly updating the gain vectors g(n) and 
z(n). 

2. Fast order-recursive algorithms for RLS FIR lattice-ladder filters by indirect or direct updating of their 
coefficients. 

3. QR decomposition—based RLS lattice-ladder algorithms using the Givens rotation. 


All relationships in Section 6.3 are valid for the prewindowed LS problem, but we replace P by E to emphasize the 
energy interpretation of the cost function. The quantities appearing in the partitionings given by (3.6.3) specify a 
prewindowed LS forward linear predictor —a,, and an LS backward linear predictor —b,,. Table 9.8 shows the 
correspondences between general FIR filtering, FLP, and BLP. Using these correspondences and the normal 
equations for LS filtering, we can easily obtain the normal equations and the total LSE for the FLP and the BLP, 
which are also summarized in Table 9.8 (see Problem 9.28). We stress that the predictor parameters a,,(n) and 
b,,(n) are held fixed over the optimization interval 0< j<n. 


TABLE 9.8 





Cost function En =Y A eP Ef (n)= > A™ lexi AOD A ere 
j=0 j= j=0 


ĝ„(n)e„(n)=â,(n) ĝ„(n-1)a„(n)=-P4(n) ĝ„(n)b, (n) = 72 (n) 
E,,(n) = E,(n)—e7 (n)g,,(”) E! (n)=E,(n) +a} (n)Fh(n) E! (n) = E,(n—m) +b" (n)7*(n) 


cai : ` ene tal l : 





Cross-correlation 


d,(n) =>, A” ’x, (Ay) PAn) =>) Ax- DC) Pn) = Ax, (x -m) 
j=0 j=0 jo 


vectors 





Table 9.9 summarizes the a priori and a posteriori time updates for the LS FIR filter derived in Section 9.5. If we 
use the correspondences between general FIR filtering and linear prediction, we can easily deduce similar 
time-updating recursions for the FLP and the BLP. These updates, which are also discussed in Problem 9.29, are 
summarized in Table 9.9. 
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a 9.9 
ummary of LS ee relations sing a ee anda apts errors. 


P| emis | ea OO O Apennines | 
| Gain | @ |R g= xm) AR, (aE, (n) = x, (n) 
zn) =n) 
| | © fe) =e, (1-1) +8,n(m)e, (2) Cn) = 6, (2-1) +B (ME, (n) 


Ff ow | E,, (n) = AE,,(n-1)+@,,(n) |e,,(n) 
[a | a ies 


| A fan= a, (n—1)—g,,(n—- oa a,,(n) =a,,(n—-1)—g,,(n- I)e (n) 


fin\— 1Ff f 2 f — apt, _ let (n) i 
(2) E! (n) = AE! (n-1)+a@,,(n—-1) |e! (n)| E (n) = AE, (n-1) +—»—— 
& m=) 


e (n) =x(n-m) +b? (n-1)x„(n) e (n) =x(n-m) +b" (n)x,,(n) 
| | [8 (n) =8, (n-D-g, (neh (n) b, (n) =, (n—1)-Z,, (n)e”"(n) 


b 2 
Re ee aeae a 
AA C 


9.6.1 Fast Fixed-Order RLS FIR Filters 





























E,,(n) =2E,, (n- 1) + nL 
a, (n) 


el (n) = x(n) +a" (n)x,,(n -1) 












The major computational task in RLS filters is the computation of the gain vector g(n) or g(n). The CRLS 
algorithm updates the inverse matrix Rn) and then determines the gain vector via a matrix-by-vector 
multiplication that results in O(M*) complexity. The only way to reduce the complexity from O(M’) to O(M) 
is by directly updating the gain vectors. We next show how to develop such algorithms by exploiting the 
shift-invariant structure of the input data vector shown in (9.6.1). 

Fast Kalman algorithm: Updating the gain g(n) 

Suppose that we know the gain 


al 


&,(n-)=R,, (2-Dx,,(n—1) (9.6.5) 
and we wish to compute the gain 
8, (n) = R,, (2) x,,(n) (9.6.6) 


at the next time instant by “adjusting” g,,(m—1), using the new data {x,,(n), y(n)}. 
If we use the matrix inversion by partitioning formulas (6.1.24) and (6.1.26) for matrix 
Ê„ (n), we have 


a-l 
D gaa Rm (n) 0, 1 b.. (n) 
ang Rman) = o o TE E [1 an (n)| (9.6.8) 
ki 0, R)| Ef(n)|a,„(n) n 


as was shown in section 6.1. 
Using (9.6.7), the first partitioning in (9.6.1), and the definition of €} (n) from Table 9.9, we obtain 


[8,0] e(n) [b,(n) neni 
gm0 =| 0 -ael l | “ve 


m 
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which provides a pure order update of the gain vector g,,(n). Similarly, using (9.6.8), the second partitioning in 
(9.6.1), and the definition of ef (n) from Table 9.9, we have 


0 é(n)| 1 
_ a 9.6.10 
Ena (n) ai ~ of E‘ (n) HA T: 


which provides a combined order and time update of the gain vector g,,(n). This is the key to the development of 
fast algorithms for updating the gain vector. 

Given the gain g,,(n—1), first we compute g,,,,(m), using (9.6.10). Then we compute g,,(n) from the first 
m equations of (9.6.9) as 


g,,(n) = g\"|(n) — g (nb, (n) (9.6.11) 
b 
because gem (n) = En(n) (9.6.12) 
E° (n) 


from the last equation in (9.6.9). The updatings (9.6.9) and (9.6.10) require time updatings for the predictors a,,(n) 
and b„(n) and the minimum error energies Ef(n) and E’(n), which are given in Table 9.9. The only 
remaining problem is the coupling between g,,(n) in (9.6.11) and b„(n) in 


b(n) =b,,(n—-1)—g,,(n)ee’(n) (9.6.13) 
which can be avoided by eliminating b,„ (n). Carrying out the elimination, we obtain 


g, (n) = Sas (= gwi (Wb, (n -1 (0.6.14) 
: l- gan (njen (n) 

which provides the last step required to complete the updating. This approach, which is known as the fast Kalman 

algorithm, was developed in Falconer and Ljung (1978) using the ideas introduced by Morf (1974). To emphasize the 

fixed-order nature of the algorithm, we set m=M and drop the order subscript for all quantities of order M. The 

computational organization of the algorithm, which requires 9M operations per time updating, is summarized in Table 9.10. 


TABLE 9.10 
Fast Kalman algorithm for time updating of LS FIR filters. 
Equation Computation 
Old estimates: a(n—1),b(n-1),g(n-1), c(n-1), E/(n—1) 
New data: {x(n), y(n)} 





Gain and predictor update 
(a) e! (n) = x(n) +a" (n-1)x(n-1) 
(b) a(n)=a(n-1)-g(n-1)e” (n) 
(c) el (n) =x(n)+a"(n)x(n-1) 
@ E! (n) = AE! (n-1)+€e! (n)e**(n) 
| 0 ef(n)| 1 
© eul, ov tS oH in| 
(C) e(n)=x(n-M)+b"”(n-1)x(n) 
[m] (M+1) 
= Eua (n) — Sux (n)b(n =1) 
(g) g(n)= E aaua o Oe) 
(h) b(n) = b(n-1)- g(n)e” (n) 
Filter update 
© e(n) = y(n)-c”(n-1)x(n) 


(j) e(n) =c(n—1)+ g(n)e"(n) 
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The FAEST algorithm: Updating the gain 2(n) 
In a similar way we can update the gain vector 


£,,(n) = LAG -1)x„(n) (9.6.15) 


by using (9.6.9) and (9.6.10). Indeed, using (9.6.10) with the lower partitioning (9.6.1) and (9.6.9) with the upper 
partitioning (9.6.1), we obtain 


i 0 e (n) 1 
2 m 9.6.16 
Email”) o - ` s AE! (n—1) a — Y i i 
_ g,,(7) e(n) | b,,(n—-1) 

d =|?" an 9.6.17 
= Eml) l 0 Paes 1 ii 
which provide a link between g (n—1) and g,(n). From (9.6.17) we obtain 

gn) = g”) -gnb (n—-1) (9.6.18) 
because gn) = an (9.6.19) 
ii AE? (n-1) 


from the last row of (9.6.17). The fundamental difference between (9.6.9) and (9.6.17) is that the presence. of 
b„(n—1) in the latter breaks the coupling between gain vector and backward predictor. Furthermore, (9.6.19) can be 
used to compute e} (n) by 


e (n) = AE? (n-1)g (n) (9.6.20) 


m+l 


with only two multiplications. 
The time updatings of the predictors using the gain g „(n), which are given in Table 9.9, require the a posteriori 
errors that can be computed from the a priori errors by using the conversion factor 


Gn(n) =1+ #4 (n)x,,(n) (9.6.21) 


which should be updated in time as well. This can be achieved by a two-step procedure as follows. First, using (9.6.16) 
and the lower partitioning (9.6.1), we obtain 


k4 2 
Oman) = &,(n—1) rae (9.6.22) 


which is a combined time and order updating. Then we use (9.6.17) and the upper partitioning (9.6.1) to obtain 


Bm(") = Gnai(n) -gE (ne? (n) (9.6.23) 
— aios we len PPP 
or m(n) = Ama ln) Ea- E*(n—1) (9.6.24) 


which in conjunction with (9.6.22) provides the required time update @,,(n-1) > @mii(n) > @&,,(n)- 

This leads to the fast a posteriori error sequential technique (FAEST) algorithm presented in Table 9.11, which 
was introduced in Carayannis et al. (1983). The FAEST algorithm requires only 7M operations per time update and is 
the most efficient known algorithm for prewindowed RLS FIR filters. 


Fast transversal filter (FTF) algorithm. This is an a posteriori type of algorithm obtained from the FAEST by 
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using the conversion factor 
a@,,(n) =1-g"(n)x,,(n) (9.6.25) 


instead of the conversion factor @,,(n)=1/a@,,(n) . Using the Levinson recursions (9.6.9) and (9.6.10) in 
conjunction with the upper and lower partitionings in (9.6.10), we obtain 


= lemn (9.6.26) 
Qnn) — a, (n) E? (n) 
e! (n) f 
a Qnan) = OF, (n — 1) — on (9.6.27) 


respectively. To obtain the FTF algorithm, we replace @,,(m) in Table 9.11 by 1/a@,,(m) and Equation (h) by 
(9.6.27). To obtain @,,(n) from @,,,,(n), we cannot use (9.6.26) because it requires quantities dependent on 
a,,(n) . To avoid this problem, we replace Equation (i) by the following relation 


ooge m 9) __ (9.6.28) 
1 z Ona (1) 8 mai (n)e,, (n) 

obtained by combining (9.6.24), (9.6.19), and @,,(n) =1/@,,(n) . This algorithm, which has the same complexity as 

FAEST, was introduced in Cioffi and Kailath (1984) using a geometric derivation, and is known as the fast 


transversal filter (FTF) algorithm. 


TABLE 9.11 
T algorithm for time updating of LS FIR filters. 


FAEST alg 
Equation 
| | Oldestimates: a(n-1),b(n-I}e(n-1),8(n-1), E’ (n-1), E*(n—1), (2-1) New data: {x(n), y(n)} 
Gain and predictor update 
(a) | e/(n)=x(n)+a"(n-1)x(n-1) 


__e/(n) 


(c) a(n) =a(n—-1)—Z(n-l)e”(n) 
(d) Ef (n)=AE! (n-1) + €! (n)e”* (n) 


0 ef (n) 1 | 
= + 
Z(n-1) AE! (n-1) a(n-1) 
(f) | e’(n) =AE*(n-1By (n) 
(g) | en) = BM Nn) -gt nban- 


je (n) i 
AE‘ (n-1) 


(h) Gunn) = A@(n—-1)+ 


@ | Zn) = auan) -gH e(n) 
@ |b =bn-1)-g(n)e™ (n) 


E” (n)=AE”(n—1)+e°(n)e™ (n) 


(m)  |eln)=y(n)-c”(n-1)x(n) 


(n) 





(0) c(n)=c(n-1)+g(n)£* (n) 
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An alternative updating to (9.6.27) can be obtained by noticing that 


f 2 
G,,,,(n) = @,(n-1)-@2 (n-1 len| 
‘ma (n) = Qp (n -1 -n 1) E' (n) 
a (n—1 
= ra -a (n-1)ļe! (n) P] 
f — 
or equivalently æn (n)=&œ,(n—1) ee (9.6.29) 
m(n 
which can be used instead of (9.6.27) in the FTF algorithm. In a similar way, we can show that 
b 
Qna (n) = ar, (ny a a) D) (9.6.30) 


E? (n) 
which will be used later. 


Some practical considerations 

Figure 9.36 shows the realization of an adaptive RLS filter using the direct-form structure. The coefficient 
updating can be done using any of the introduced fast RLS algorithms. Some issues related to the implementation of 
these filters using multiprocessing are discussed in Problem 9.48. 







x(n— 1) x(n-M+1) 
x(n) ai wie 


cy(n— 1) 


y(n) e(n) 


updating 





FIGURE 9.36 
Implementation of an adaptive FIR filter using a direct-form structure. 


In practice, the fast direct-form RLS algorithms are initialized at n=0 by setting 
E‘(-1)=E*(-1)=6>0 
a(-l=1 or Q@(-l)=1 
and all other quantities equal to zero. The constant Ê is chosen as a small positive number on the order of 0.0102 
(Hubing and Alexander 1991). For Å <1, the effects of the initial conditions are quickly “forgotten.” 

Although the fast direct-form RLS algorithms have the lowest computational complexity, they suffer from 
numerical instability when A <1 (Ljung and Ljung 1985). When these algorithms are implemented with finite 
precision, the exact algebraic relations used for their derivation breakdown and lead to numerical problems. 

There are two ways to deal with stabilization of the fast direct-form RLS algorithms. In the first approach, we try 
to identify precursors of ill behavior (warnings) and then use appropriate rescue operations to restore the normal 
operation of the algorithm (Lin 1984; Cioffi and Kailath 1984). One widely used rescue variable is 


a Gnu(n) _ AE, (n=1) 
a, (1) E, (n) 


(9.6.31) 


Nn (1) (9.6.32) 
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which satisfies 0<7,,(n) <1 for infinite-precision arithmetic. 

In the second approach, we exploit the fact that certain algorithmic quantities can be computed in two different 
ways. Therefore, we could use their difference, which provides a measure of the numerical errors, to change the 
dynamics of the error propagation system and stabilize the algorithm. For example, both e(n) and @,,(n) can be 
computed either using their definition or simpler order-recursions. This approach has been used to obtain stabilized 
algorithms with complexities 9M and 8M; however, their performance is highly dependent on proper initialization 
(Slock and Kailath 1991, 1993). 


9.6.2 RLS Lattice-Ladder Filters 


The lattice-ladder structure’ derived in Section 6.3 using the MSE criterion, due to the similarity of (9.6.2) and 
(9.6.3), holds for the prewindowed LSE criterion as well. This structure, which is depicted in Figure 9.37 for the a 
posteriori error case, is described by the following equations 


Ei (n) = €}(n) = x(n) 


él (ny=el(nytki(njeb(n-1) O<m<M -1 (9.6.33) 
€.(n) =e? (n-1l) +k?" (njef (n) 0<m<M -1 (9.6.34) 
Lattice stage 1 Lattice stage M — 1 
f f 
eg(n) e\(n) 






ky-2(n) 


kyon) 





FIGURE 9.37 
A posteriori error RLS lattice-ladder filter. 





for the lattice part and 
E = 
o(n)= y(n) | (9.6.35) 
Ea (n) =E) -ke (nen) O<m<s<M-1 
for the ladder part. The lattice parameters are given by 
ki (n)= -am (9.6.36) 
m E„(n-1) 
and kè (n) = a (9.6.37) 
ji E, (n) 
and the ladder parameters by 
ks (n) = Fa™ (9.6.38) 
Ep (n) 


‘In Chapter 6 we used the symbol e(n) because we had no need to distinguish between a priori and a posteriori errors. However, since 
the error e(n) in Section 6.3 is an a posteriori error, we now use the symbol €(n). 
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where B,(n) =b} (n-Dri(n) t+ ri (n) (9.6.39) 


and Bi (n) =b} (n)d,,(n)+d,,,,(n) (9.6.40) 
are the partial correlation parameters. 
However, as we recall, the time updating of the minimum LSE energies and the partial correlations is possible 


only if there is a time update for the correlation matrix R,,(n) and the cross-correlation vector d,,(n). 
The minimum LSE energies can be updated in time using 


Ef (n) = AE! (n-1) +e! (n)ef* (n) (9.6.41) 
E? (n) = AE? (n-1) +e? (n)eé” (n) (9.6.42) 


or their variations, given in Table 9.9. 
To update the partial correlation £,,(n), we start with the definition (9.6.39) and then use the time-updating 
formulas for all involved quantities, rearranging and recombining terms as follows: 


B, (n+1)=b}(n)rf(n+1)+ rf (n+1) 
=b} (n)[Arf (n) + x„(n)x* (n +1)]+[Ari  (n)+ x(n- m)x*(n+1)] 
= Ab! (n)rf (n)+ e° (n)x*(n+1)+ Arf (n) 
= Ab} (n—1)- e} (n)g „(nri (n) + Art, (n) + e (n)x' (n +1) 
=AB (n)+e>(n)[x"(n+1) —Azg,(n)ri(n)] 
=AB_(n)+e>(n)[x"(n+1)— x5 (n)R7'(n—1)rf (n)] 
=AB_(n)+e>(n)[x"(n+1) + xË (n)a(n)] 
=Aß,(n)+ €? (n)ef"(n+1) 


which provides the desired update formula. The updating 


B,,(n) =AB,(n- 1) + En -1e (n) (9.6.43) 
=AB,(n-1)+ __| en -1)ef" (n) (9.6.44) 
a,,(n —1) 


is feasible because the right-hand side involves already-known quantities. 
In a similar way (see Problem 9.36), we can show that 


B:n) = ABS (n—-1) + eÈ (n)e* (n) (9.6.45) 


1 


= AB. (n-1)+ K 


£ (n)e: (n) (9.6.46) 





which facilitates the updating of the ladder parameters. 

To obtain an a posteriori algorithm, we need the conversion factor @,,(n), which can be obtained using the 
order-recursive formula (9.6.26). A detailed organization of the a posteriori LS lattice-ladder algorithm, which requires 
about 20M operations per time update, is given in Table 9.12. The initialization of the algorithm is easily obtained from 
the definitions of the corresponding quantities. The condition œ% (n—1)=1 follows from (9.6.25), and the positive 
constant 6 is chosen to ensure the inveribility of the LS correlation matrix R(n) (see Section 9.5).The time-updating 
recursions (c) and (d) can be replaced by order recursions, as explained in Problem 9.37. 
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TABLE 9.12 
Computational organization of a posteriori LS lattice-ladder algorithm. 
Equation Computation 
Time initialization (n = 0) 
E! (-1)= E?(-l)=6>0 0<m<M -1 
B,,(-1) =0,€? (-1) =0 0<m<M -1 
A (-1)=0 0<m<M -1 
Order initialization 
(a) Ef (n)=E(n)=x(n) —&(n)= y(n) a (n-1)=1 
Lattice part: m = 0.1, ..., M - 1 
2 n, En(n— NEw (n) 
(b) B,,(n) = AB, (n ey =) 
f _ i frs Je(n) j 
(c) E (n)=AE,,(n Pe nt 
b 2 
(d) EX(n) = AES(n-1) + EL 
a, (n) 
Fn wee 
(e) ka (n) EnD 
k? n —B* (n) 
(f) m(n) E! (n) 
(g) e! a(n) =e (n+ ki (n)e?(n—1) 
(h) e? (n) =e? (n-1) +k? (n)ef (n) 
j a„(n)eè (n) P 
(i) Gy (N) = ee ae 
Ladder part: m = 1, 2, ..., M 
G) Bi (n) = Ap; (n -1) + €°(n)e*(n)/a,,(n) 
k a(n) 
k =m 
( k ) m(n) E? (n) 
(1) Ema (n) = E, (2) — ke (nye, (n) 


If instead of the a posteriori errors we use the a priori ones, we obtain the following recursions 


ef (n)= e(n) = x(n) 


ef .(n) =e! (n) +ki*(n—le?(n-1) O0<m<M-1 (9.6.47) 
e° (n) = e° (n —1)+ k(n — Ie! (n) O<m<M-1 (9.6.48) 

for the lattice part and 
alm) = y(n) (9.6.49) 


e„a(n)=e,(n)— kè (n—1)e? (n) 1<m<M 


for the ladder part (see Problem 9.38). As expected, the a priori structure uses the old LS estimates of the 
lattice-ladder parameters. Based on these recursions, we can develop the a priori error RLS lattice-ladder algorithm 
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shown in Table 9.13, which requires about 20M operations per time update. 








TABLE 9.13 
Computational organization of a priori LS lattice-ladder algorithm. 
Equation Computation 
Time initialization 
E‘(-1)=E°(-l)=6>0 
B(-)=0 e(-l)=0 OSm<M-1 
Bi(-1l)=0 O<m<M-I 
Order initialization 
(a) e}(n) = ep(n) = x(n) @ (n) = y(n) œ% (n—-1)=1 
Lattice Part: m = 0.1, ...,M - 2 
(b) e}.(n) =e! (n) +k" (n—Me?(n-1) 
(c) e? (n) =e? (n-1) +k? (n—-let (n) 
(d) B,(n) = AB, (n-1) + a, (ne? (n—I)es*(n) 
(e) E!(n)=AE\ (n-1)+a@,(n-1) |e (n)? 
(f) E? (n) = AE? (n —1)+&,(n) |e? (n)? 
— n 
(h) ka (n) i 
. = |e (n) P 
(i) G,,(n) = ni On) 
Ladder part: m = 1,2, ...,M 
G) B (n) = AB, (n-1) +œ, (ney, (ne, (n) 
(k) ki (n) es 
(1) e„a(n)=e,(n)- k(n — le? (n) 





9.6.3 RLS Lattice-Ladder Filters Using Error Feedback Updatings 


The LS lattice-ladder algorithms introduced in the previous section update the partial correlations £,,(n) and 
a(n) and the minimum error energies Ef(n) and E?’(n), and then compute the coefficients of the LS 
lattice-ladder filter by division. We next develop two algebraically equivalent algorithms, that is, algorithms that 
solve the same LS problem, which update the lattice-ladder coefficients directly. These algorithms, introduced in 
Ling et al. (1986), have good numerical properties when implemented with finite-word-length arithmetic. 
Starting with (9.6.38) and (9.6.45) we have 
_ BM) _ 7 AnD En(n=1) a, (ne, (Men (2) 
E’(n) = E®(n-1) E!(n) E’ (n) 
— 1 
E;,(n) 


kp (n) 
(9.6.50) 
[ki (n— DAE’ (n—1)+ &,(n)e? (nje: (n)] 


or using AE, (n—1) = E (n) - œ, (nyen (nye, (n) 
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we obtain 
c PED = ea a, (n)e? (n) * _ pe = b* 
kp (n) = k,,(n Dt En) th le, (n) -kp (n le, (n)] 
or k°(n)=k<(n-1) AOAO] (9.6.51) 


E, (n) 


using (9.6.49). Equation (9.6.51) provides a direct updating of the ladder parameters. Similar direct updating formulas 
can be obtained for the lattice coefficients (see Problem 9.39). Using these updatings, we obtain the a priori RLS 
lattice-ladder algorithm with error feedback shown in Table 9.14. 


TABLE 9.14 
Computational organization of a priori RLS lattice-ladder algorithm with direct updating of its coefficients using error feedback 
formula. 


Equation Computation 
Time initialization 
E! (-1)= E? (-1)=ĝ>0 
kp) =k, (1) = 0 
e>(-1)=0 ki (-1)) =0 
Order initialization 
(a) e} (n) = e)(n) = x(n) e,(n) = y(n) a(n) =1 


Lattice part: m = 0.1, ..., M - 2 


(b) e' (n)=e!(n)+kf(n-1)e?(n—1) 
(c) eè .(n) =e? (n-1) +k" (n—Del (n) 
(d) E! (n) =AE'(n-1)+@,(n-1) |e (n) f 
(e) E? (n) = AE}? (n-1)+ æ, (n) |e? (n)? 

tny = kf (n1) -ZDE Denn (0) 
(f) ka (n) =k„(n—1) E*(n=1) 

f b* 

k? =k? =f _ Qn (n-en (N)en (N) 

(g) =) E T 
-æ (m eÈ 
(h) Qna (n) = æ, (n) E (n) 
Ladder part: m = 0.1, ..., M - 1 

( I) enn (n) = €,,(n) — k; (n -I)e (n) 

è š @,, (ner (n)e* .,(n) 
. k =k —1 iam m m+ 
G) m(n) = kp (n—1) Ean 


We note that we first use the coefficient k{(n—1) to compute the higher-order error e„,(n) by (9.6.49) and 
then use that error to update the coefficient using (9.6.51). This updating has a feedback-like structure that is 
sometimes referred to as error feedback form. An a posteriori form of the RLS lattice-ladder algorithm with error 
feedback can be easily obtained as shown in Problem 9.40. Simulation studies (Ling et al. 1986) have shown that 
when we use finite-precision arithmetic, the algorithms with direct updating of the lattice coefficients have better 
numerical properties than the algorithms with indirect updating. 
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9.7 Tracking Performance of Adaptive Algorithms 


Tracking of a time-varying system is an important problem in many areas of application. Consider, for example, a 
digital communications system in which the channel characteristics may change with time for various reasons. If we 
want to incorporate an echo canceler in such a system, then clearly the echo canceler must monitor the changing 
impulse response of the echo path so that it can generate an accurate replica of the echo. This will require the adaptive 
algorithm of an echo canceler to possess an acceptable tracking capability. Similar situations arise in adaptive 
equalization, adaptive prediction, adaptive noise canceling, and so on. In all these applications, adaptive filters are 
forced to operate in a nonstationary SOE. In this section, we examine the ability and performance of the LMS and 
RLS algorithms to track the ever-changing minimum point of the error surface. 

As discussed earlier, the tracking mode is a steady-state operation of the adaptive algorithm, and it follows the 
acquisition mode, which is a transient phenomenon. Therefore, the algorithm must acquire the system parameters 
before tracking can commence. This has two implications. First, the rate of convergence is generally not related to the 
tracking behavior, and as such, we analyze the tracking behavior when the number of iterations (or steps) is relatively 
large. Second, the time variation of the parameter change should be small enough compared to the rate of 
convergence that the algorithm can perform adequate tracking; otherwise, it is constantly acquiring the parameters. 


9.7.1 Approaches for Nonstationary SOE 


To effectively track a nonstationary SOE, adaptive algorithms should use only local statistics. There are three 
practical ways in which this can be achieved. 


Exponentially growing window 
In this approach, the current data are artificially emphasized by exponentially weighting past data values, as 
shown in Figure 9.38(a). The error function that is minimized is given by 





E(n) =J A | y(j)—e"x(j) P= AE(n— 1+ | y(n) — x(n) f (9.7.1) 
j=0 
a- 
0 n 0 n 2 
yeti 
0 n+1 ý 0 n+1 á 
An+2-j 
0 n+2 * 0 n+2 v 
(a) Exponentially growing window (b) Fixed-length sliding window 


FIGURE 9.38 
Illustration of exponentially growing and fixed-length sliding windows. 
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where 0</<1. Clearly, this is the cost function we used in the development of the RLS algorithm, given in Table 
9.6, in which A is termed the forgetting factor. The effective window length is given by 


2a i 
An 
L = =o =a (9.7.2) 
Hence for good tracking performance A should be in the range 0.9<A<1. Note that A=1 results in a 
rectangularly growing window that uses global statistics and hence will not be able to track parameter changes. Thus 
the RLS algorithm with exponential forgetting is capable of using the local information needed to adapt in a 
nonstationary SOE. 


Fixed-length sliding window 

The basic feature of this approach is that the parameter estimates are based only on a finite number of past data 
values, as shown in Figure 9.38(b). Let us consider a rectangular window of fixed length L >M . Then the cost 
function that is minimized is given by 


E(n, L) = 5 Iy(j)-cÄ"x(j) f? (9.7.3) 


j=n-L+1 
When a new data value at n+1 is added to the sum in (9.7.3), the old data value is discarded, that is, all old data 
values beyond n—L+1 are discarded. Thus the active number of data values is always a constant equal to L, which 
makes this as a constant-memory adaptive algorithm. By following the steps given for the RLS adaptive filter in 
Section 9.5, it is possible to derive a recursive algorithm to determine the filter c(n) that minimizes the error 
function in (9.7.3). 

Let Cin- (n—1) denote the estimate of c(n—1) based on L data values between n—L and n—1. After 
the new data value at n is observed, the RLS algorithm in Table 9.6 is applicable with A=1 and with obvious 
extension of notation. Hence we obtain the algorithm 


Cin- O) E Cn- (2-1) En- (ne (n) (9.7.4) 
e(n)= y(n)-c}_,(n-1)x(n) (9.7.5) 

Bin- (n) = Te (9.7.6) 

E n-p) = Pn- (2-1) x(n) (9.7.7) 

Qnn (n) =1+ Ep (xn) (9.7.8) 

P p-a) = Pip 0-1) 8p- YB n(n) (9.7.9) 


The above algorithm is based on L+1 data values. To maintain the data window at fixed length L, we have to 
discard the observation at n-— L . By using the matrix inversion lemma it can be shown that (see Problem 9.51) 


Cin- (n) = Cin-L} (n) p 8 (n-L+1} (n)e"(n izi L) (9.7. 10) 
e(n—L) = y(n—L)—e7)_,,(n)x(n-L) (9.7.11) 
Ein- 0) = Ferny) (9.7.12) 


Œ n-i) (n) 
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E inti) = Pn- (1) x(n — L) (9.7.13) 
Qnm n) = 1 Enin X(n =L) (9.7.14) 
P n-i (”) =- P n-i) (n) F Bin- ME p-n) (9.7. 15) 


The overall algorithm for the fixed-memory rectangular window adaptive algorithm is given by (9.7.4) through 
(9.7.15), which recursively update Cip- (n—1) to Cin- (7). Thus, this algorithm can adapt to the nonstationary 
SOE using the local information. The fixed-length sliding-window RLS algorithm can be implemented by using a 
combination of two prewindowed RLS algorithms (Manolakis et al. 1987). 


Evolutionary model—Kalman filter 

In the first two approaches, adaptation in the nonstationarity SOE was obtained through the local information, 
either by discarding old data or by deemphasizing it. In the third approach, we assume that we have a statistical model 
that describes the nonstationarity of the SOE. This model is in the form of a stochastic difference equation together 
with appropriate statistical properties. This leads to the well-known Kalman filter formulation in which we assume 
that the parameter variations are modeled by 


c(n) = E(n)ce(n—1)+0(n) (9.7.16) 


where v(n) is a random vector with zero mean and correlation matrix X(n), and E(n) is the state-transition 
matrix known for all n. The desired signal y(n) is modeled as 


y(n) =c"(n)x(n) + €(n) (9.7.17) 


where e(n) is the a posteriori estimation error assumed to be zero-mean with variance o2. Thus in this 
formulation, the parameter vector c(n) acts as the state of a system while the input data vector x(n) acts as the 
time-varying output vector. Now the best linear unbiased estimate ¢(n) of c(n) based on past observations 
{ y(i)}9 can be obtained by using the Kalman filter equations (Section 6.8). These recursive equations are given by 


é(n) = E(n)é(n—1) + g(n)L y(n) — 6"(n- DE" (0) x(n)] (9.7.18) 


=(n)P(n—-1)x(n) 


= (9.7.19) 
oO, +x" (n)P(n—1)x(n) 


g(n)= 


P(n) =2(n)P(n-WE"(n)+X(n) 


x(n)x"(n) (9.7.20) 


—E(n)P(n-1) P(n-1)="(n) 
(oy 


> + x"(n)P(n—1)x(n) 


E 


where g(n) isthe Kalman gain matrix and P(n) is the error covariance matrix. This approach implies that if the 
time-varying parameters are modeled as state equations, then the Kalman filter rather than the adaptive filter is a 
proper solution. 

Furthermore, it can be shown that the Kalman filter has a close similarity to the RLS adpative filters if we make 
the following appropriate substitutions: 


Exponential memory: If we substitute 
E(n) =I o} =A X(n) = “ur — g(n)x"(n)|P(n-1) (9.7.21) 


then we obtain the exponential memory RLS algorithm given in Table 9.9. 
Rectangularly growing memory: If we substitute 
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E(n)=I oz =1 = X(n) =0 (9.7.22) 


then we obtain the rectangularly growing memory RLS algorithm. 


9.7.2 Preliminaries in Performance Analysis 


In Sections 9.4 and 9.5.4, we developed and analyzed the LMS and RLS algorithms in stationary environments, 
respectively. However, these algorithms are generally used in applications (e.g., modems) that are intended to operate 
continuously in SOE whose characteristics change with time. Therefore, we need to discuss the performance of these 
two widely used algorithms in such situations. Although we provided various adaptive filtering approaches for 
time-varying environments above, we now discuss, in the remainder of this section, the ability of these two 
algorithms to track time-varying parameters. We provide both analytical results, assuming a model of parameter 
variation, and experimental results, using simulations. 

A popular approach for this analytical assessment is to assume a first-order AR model with finite variance [that 
is we set E(n)= pI in (9.7.16)]. Although higher-order models are also possible, only a few results on the tracking 
performance using these models are currently available. It is ironic that most analytical results on the tracking 
performance have been obtained for the random-walk model (a special case of the first-order AR model), which is 
unrealistic because of the infinite variance. A tutorial review of the latest results for the general case and additional 
references are available in Macchi (1996). 

In our analysis of tracking characteristics of the LMS and RLS algorithms, we use the first-order AR model and 
discuss its effect on the tracking performance. The closed-form results will be given using the random-walk model 
and confirmed using simulated experiments. 


Analysis setup 
In the tracking analysis, it is desirable to use the a priori adaptive filter. Hence we assume that the desired 
response is generated by the following filter model” 


y(n) =e! (n-1)x(n)+v(n) (9.7.23) 


where v(n) is assumed to be WGN(0, g?) with o2?<oc. The random processes x(n) and v(n) are 
assumed to be independent and stationary. The variation of c,(n) is modeled by the first-order AR (or Markov) 
process 
c,(n) = pe,(n—1)+yw(n) (9.7.24) 
with 0< <1 and creates the nonstationarity of the SOE. The quantity y(n) is the uncertainty in the model and 
assumed to be independent of x(n) and v(n), with mean E{y(n)}=0 and correlation E{y(n)y"(n)}=R,. 
Tracking is generally achievable if p is close to 1. The random-walk model is obtained by using p=1 in 
(9.7.24). 
Conjugate transposing and premultiplying both sides of (9.7.23) by x(n), taking the expectation, and using 
independence between x(n) and v(n), we obtain 
Re(n—1) =d(n) (9.7.25) 
Hence, c,(n—1) is the optimum a priori filter and 


e,(n) = y(n) — ce! (n-1)x(n) = v(n) (9.7.26) 


is the optimum a priori error. If R,=0 and p=1, we have c,(n)=c, for all n, and therefore y(n) is 
wide-sense stationary (WSS). In this case, we have a stationary environment, and the goal of the adaptive filter is to 


find the optimum filter ¢,.For R, #0, the adaptive filter should find and track the optimum a priori filter ¢,(7) . 
This setup, which is widely used to analyze the properties of adaptive algorithms, is illustrated in Figure 9.39. 


We use this model to make a fair comparison between the adaptive and the optimum filter. 
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FIGURE 9.39 
Block diagram of the setup and model used for the analysis of adaptive algorithms. 


Assumptions 

To analyze the tracking performance of adaptive algorithms, we use the assumptions discussed elsewhere and 
repeated below for convenience. 

A1 The sequence of input data vectors x(n) is WGN(0, R). 


A2 The desired response y(n) can be modeled as 
y(n) =c"(n—1)x(n) + e,(n) (9.7.27) 
where e,(n) is WGN (0, 02). 


A3 The time variation of c,(n) is described by 
c,(n) = pe,(n—1)+ y(n) (9.7.28) 

where OS PS1 and y(n) is WGN(0,R,)- 

A4 Therandom sequences x(n), e,(n),and y(n) are mutually independent. 

Through these assumptions, we want to stress that the nonstationarity of the SOE is created solely by c,(n) 
and not by x(n), which is WSS. > 

Although we provide analysis for (9.7.27), many results are given for the random walk model ( p = 1 ). The case 
0< p <1, which is straightforward but complicated, is discussed in Solo and Kong (1995). Before we delve into 
this analysis, we discuss criteria that are used for evaluating the tracking performance. 


Degree of nonstationarity 

To determine whether an adaptive algorithm can adequately track the changing SOE, one needs to define the 
speed of variation of the statistics of the adaptive filter environment. This speed is quantified in terms of the degree of 
nonstationarity (DNS), introduced in Macchi (1995, 1996), and is defined by 


2 
n(n) 4 E{| Yo,iner (72) | } $ (9.7.29) 
\ P,(n) 


where Yo.inee (2) = [e,(n) —¢,(n—1)]" x(n) (9.7.30) 
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is the output of the incremental filter. The numerator is the power introduced by the variation of the optimum filter, 
and the denominator is the MMSE, which in the context of (9.7.26) is equal to the power of the output noise. 
Assuming p=1 in (9.7.28), we see that (9.7.30) is given by 


(n)=¥" x(n) 


Yo,iner 


and hence the numerator in (9.7.29) is given by 
E{l Yo iner (n) P} = E{ PY" x(n) x" (n) P} = [E {Yxn x" (n) PY] 
= [EPP x(n) x") ] = tr[E{ PY" Ef x(n) x4] (9.7.31) 
=t[RyR]=t[RRy] 
where we have used the independence assumption A4. Substituting(9.7.31) in (9.7.29), we obtain 


tr[ RR, ] 


n(n) = P(n) 


(9.7.32) 


Smaller values of 7(<<1) imply that the adaptive algorithm can track time variations of the nonstationary SOE. On 
the contrary, if 7>1, then the statistical variations of the SOE are too fast for the adaptive algorithm to keep up with 
the SOE and lead to massive misadjustment errors. In such situations. an adaptive filter should not be used. 


Mean square deviation (MSD) 

We defined the MSD A(n) in (9.2.29) as a performance measure for adaptive filters in the steady-state 
environment. It is also used for measuring the tracking performance. Consider the coefficient error vector @(n), 
which can be written as 


€(n) =ce(n)—c,(n) 
=[e(n) — E{e(n)}]+[E{e(n)}-¢,(n)] (9.7.33) 
2é(n)+é(n) (9.7.34) 
where ¢,(m) is the fluctuation of the adaptive filter parameter vector about its mean (estimation error) and ¢,(n) 
is the bias of c(n) with respect to the true vector c,(n) (systematic or lag error). Using the independence 


assumption of the previous section that x(n) and ¢(n—1) are statistically independent, we can show that (Macchi 
1996) 


E{éi'(n)é(n)} =0 (9.7.35) 


which by using (9.2.29) and (9.7.34) leads to 
AD(n) = ZA(n)+ (n) (9.7.36) 
The first MSD term is due to the parameter estimation error and is called the estimation variance. The second MSD 


term is due to the parameter lag error and is termed lag variance, and its presence indicates the nonstationary 
environment. 


Misadjustment and lowest excess MSE 
The second performance measure, defined in (9.2.38), is the (a priori) misadjustment —/(n), which is the ratio 
of the excess MSE P,(n) tothe MMSE P(n). The a priori excess MSE is given by 


P, (n) = E{|é"(n-1)x(n) P} = Ellen —-Dx(n) + e3(n-D x(n) P} (9.7.37) 
which under the independence assumption and (9.7.35) can be written as 
P,,(n) = P.,,(n) + P, (n) (9.7.38) 


where the first term, P.,,(m), is excess MSE due to estimation error and is termed the estimation noise while the 
second term, P., 2(7),, is the excess MSE due to lag error and is called the lag noise. Therefore, we can also write 
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the misadjustment M (n) as 
U(n)=M,(n)+M,(n) (9.7.39) 


where m(n) is the estimation misadjustment and —/,(n) is the lag misadjustment. 

In the context of the first-order Markov model, the best performance obtained by any a priori adaptive filter 
occurs if e(n) = pc,(n-— 1). This observation makes possible the computation of a lower bound for the excess MSE 
of any a priori adaptive algorithm. From (9.8.34) and (9.8.24), we have 


€(n) =e(n)—c,(n) =[e(n) — pe,(n—1)]-—y(n) 


‘4 (9.7.40) 
= é(n)—y(n) 
and hence 
P (n) = E{jé"(n-1) x(n) f} 
= E{|é"(n-1)x(n)-—w" (n—-1 x(n) f} (9.7.41) 
= E{|ê"(n-1)x(n) P }+ Ef(w"(n-1)x(n) 7} 
+2E{é"(n-1)x(n)x"(n)w(n-1)} (9.7.42) 


Since the term ¢(n) does not depend on y(n) and since the random sequences x(n) and w(n-1) are 
assumed independent, the last term in (9.8.42) is zero. Hence, 


P(n) 2 E{|w" (n-1)x(n) P} (9.7.43) 


which provides a lower bound for the excess MSE of any a priori adaptation algorithm. Because y(n) and x(n) 
are assumed independent, we obtain 


E{|w" (n—1)x(n) } = t(RR,) (9.7.44) 
Similarly, neglecting the dependence between x(n) and ¢(n—1), we have 
E{jé"(n—1)x(n) f} = tr[RØ(n —1)] (9.7.45) 


which provides the a priori excess MSE. Furthermore, it can be shown that the DNS places a lower limit on the 
misadjustment, that is, 


(n) = FP.) > Ely’ in- Dx fh = EER =7 (n) (9.7.46) 
P,(n) F,(n) o 


v 


9.7.3. LMS Algorithm 


Using the LMS algorithm (9.4.12), the error vector in (9.7.34), and the Markov model in (9.7.28) with p =1, we can 
easily obtain 


é(n) =U —2ux(n)x" (n) (n—1)+2ux(n)e:(n)- y(n) (9.7.47) 


which, compared to (9.4.15), has one extra input. Since x(n), e,(n),and y(n) are mutually independent, y(n) 
adds only an extra term oyJ tothe correlation of č(n). 


Misadjustment. To determine the misadjustment, we perform orthogonal transformation of the correlation 
matrix of ¢(n). When we transform (9.4.28) to (9.4.30), using the orthogonal transformation (9.4.29), the presence 


of the diagonal matrix 0; changes only the diagonal components with the addition of the term oy,. Indeed, we 
can easily show that 
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O(n) = p8 (n—1) +44? AP, (n-1) +4 P A, +0; (9.7.48) 


where P,(n) =P, =o? for large n. Clearly, (9.7.48) converges under the same conditions as (9.4.40). At steady 
state we have 


8, (ce) = 9,9, (0°) + 4° A, P, (02) +4 P A, + oy, (9.7.49) 


or using (9.4.36), we have 


2 


P +P) 1 9% 





B. (oo) = poe e (9.7.50) 
OSE oud dud, 12d, 
which in conjunction with (9.4.55) and (9.4.56) gives 
Be foo) a gt g (9.7.51) 
1-C(u) ” 4u1-CH) 
M 
A l 
where DA S ery 9.7.52 
2 1-2, oa 
If uA, «1,wehave C(yz)=ytr(R) and D(Z) =M , which lead to 
P (œ) = o> tr(R) + + Me (9.7.53) 
4u 
L Ue 
or AM (20) = utr(R) +—M = (9.7.54) 
4u O; 


Hence in the steady state, the misadjustment can be approximated by two terms. The first term is estimation 
misadjustment, which increases with x, while the second term is the lag misadjustment, which decreases with yw. 
Therefore, an optimum value of 4 exists that minimizes 4 (20), given by 


~ Pv | M_ (9.7.55) 
fox 20, \ tr(R) 
or M, (00) = JM tR) (9.7.56) 
oO. 


v 


MSD. To determine the MSD, consider (9.7.47). For small step size y , the system matrix [I —2yx(n)x" (n)] 
is very close to the identity matrix. Hence using the direct averaging method due to Kushner (1984), we can obtain a 
close solution of €(n) by solving (9.8.47) in which the system matrix is replaced by its average [IJ —24/R], that is, 


€(n) = [I —2uR]é(n-1) + 2ux(n)e;(n) — y(n) (9.7.57) 
where we have kept the same notation. Taking the covariance of both sides of (9.7.57), we obtain 
@(n) = -24R]Ø(n—1)[I —2uR] +40 R+ R, (9.7.58) 
The approximate steady-state solution of (9.7.58) is given by 
RØ + ØR = 240R E 
2u 
where the second-order term 4y°R®R_ is ignored for small values of yz. After premultiplying (9.7.59) by R™, 
we obtain 


(9.7.59) 
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R" 
Ø+ RØR = 240? + R, (9.7.60) 
2u 
Taking the trace of (9.7.60) and using tr(R'ØR) = tr(®) , we obtain 
tr(R™ 
tr(®) = uMo? + oe (9.7.61) 


By following the development in (9.7.28), it can be shown that (Problem 9.52) (cc) = tr(@) . Hence 
-1 
tr(R R,) 


(9.7.62) 
ay 


D(cc) = UM o} + 


As expected, the MSD has two terms: The estimation deviation is linearly proportional to x while the lag deviation 
is inversely proportional to jz. The optimum value of the step size jz is obtained when both deviations are equal 
and is given by 


1 /tr(R™'R,) 
oi a 9.7.63 
Mon = ON Mee nee 


v 


or Dil) = 4 [Mo2tr(R'R,) (9.7.64) 


EXAMPLE 9.7.1. To study the tracking performance of the LMS algorithm, we will simulate a slowly time-varying SOE whose 
parameters follow an almost random-walk behavior. The simulation setup is shown in Figure 9.42 and given by (9.7.27) and 
(9.7.28). The simulation parameters are as follows: 


0.95 
y(n) ~ WGN(O, R,) R, = (0.01)°T 

Signal x(n) parameters: x(n)~ WGN(0, R) R=I 

Noise v(m) parameters:  v(n) ~ WGN(0, o?) o, =0.1 


c,(n) model parameters: e€ (0) = | | M =2 p =0.999 


For these values, the degree of nonstationarity from (9.7.32) is given by 


tr 
FERRA _9.1414<1 


which means that the LMS can track the time variations of the SOE. 
Three different adaptations (slow, matched, and fast) of the LMS algorithm were designed. Their adaptation results are shown 
in Figures 9.40 through 9.45. From (9.7.55) and (9.7.63), the optimum performance is obtained when 
Ha 0.05 


n(n) = 


for which —%,,,(cc)=0.2 and Amin(°°) = 0.002 . Hence, the following values for yz were selected for simulation: 

Slow : 4 =0.01 

Matched : 4=0.1 

Fast : u=0.3 
Figure 9.40 shows the matched adaptation of parameter coefficients while Figure 9.41 shows the resulting (n) and y(n). 
Clearly, the LMS tracks the varying coefficients nicely with expected small misregistration and deviation errors. Figure 9.42 shows 
the slow adaptation of parameter coefficients while Figure 9.43 shows the resulting (n) and —y(n). In this case, although the 
LMS algorithm tracks with bounded error variance, the tracking is not very good and the resulting misregistration errors are large. 
Finally, Figure 9.44 shows the fast adaptation of parameter coefficients while Figure 9.45 shows the resulting 2(n) and —//(n). 
In this case, although the algorithm is able to keep track of the slowly varying coefficients, the resulting variance is large and hence 
the estimation errors are large. Once again, the total errors are large compared to those for the matched case. 
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FIGURE 9.40 
Matched adaptation of slowly time-varying parameters: LMS algorithm with u=0.1 
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FIGURE 9.41 
Learning curves of LMS algorithm with matched adaptation. 
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FIGURE 9.42 
Slow adaptation of slowly time-varying parameters: LMS algorithm with p = 0.01 
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FIGURE 9.43 
Learning curves of LMS algorithm for slow adaptation. 
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FIGURE 9.44 
Fast adaptation of slowly time-varying parameters: LMS algorithm with y =0.3. 
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FIGURE 9.45 
Learning curves of LMS algorithm for fast adaptation. 


9.7.4 RLS Algorithm with Exponential Forgetting 


Consider again the model given in Figure 9.40 and described in the analysis setup. 


Misadjustment. To determine the misadjustment in tracking, we first evaluate the excess MSE caused by lag, 
that is, by the deviation between E{c(n)} and the optimum a priori filter c,(n) . Combining 


c(n)=c(n—-1)+ R (n)x(n)Je"(n) (9.7.65) 
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with e*(n) =e5(n)—x"(n)[e(n-1)-¢,(n—-D] (9.7.66) 
and taking the expectation result in 
E{e(n)} = E{e(n—-1)}+ E{R '(n)x(n)x"(n)}[E{e(n-1)}-¢,(n—-1)] (9.7.67) 


because the expectation of ~ '(n)x(n)e3(n) vanishes. Using the approximation E{p™'(n)-x(n) x"(n)}=(1-A)I, 
we have 


E tag(2) = A iag(n) +c (n a 1) =c, (n) (9.7.68) 
or Čiag(n) = AG a(n oi 1) >, y(n) (9.7.69) 
for the random-walk ( p =1) model. The covariance matrix is 
®,,,(n) =A'°®,,(n—1)+R, (9.7.70) 
and in steady state (assuming 0< 1 <1) 
1 
®,,, (00) = dca? R, (9.7.71) 
The lag excess MSE is 
1 S S 
co) = tr[RØ(%)] = an" 9.7.72 
Pag ( )= [ ( = i= Ay [RRy]= z= A) tr[RR,] ( ) 


because (1— A)? =(1+ A)(I-A) =2(1-A) for A=1. 
The excess MSE due to estimation is [(1—A)/2]Mo?, hence the total excess MSE is 


P,(=)=4mo} optr(R) (9.1.13) 


O, + 
2 21-4) ” 
if R, = 03I . Finally, the misadjustment is given by 





=A o; tr(R) 
aja a ee 9.7.74 
EES *30-Ae? 


The first term in (9.7.74) is the estimation misadjustment, which is linearly proportional to 1—4, while the second 
term is the lag misadjustment, which is inversely proportional to 1—4. The optimum value of A is given by 


oO, | 1 
Aw = 1- os woe (9.7.75) 
and the minimum misadjustment is given by 
O, 
Man (00) =— JM tr(R) (9.7.76) 
O, 


MSD. An analysis similar to the MSD development of the LMS algorithm can be done to obtain 


1- A oO. 


D(cc) = —— R” 9.7.77 
= 2 vtr(R Hr A) 





with 3 atoms Lt (9.7.78) 


= o, tR”) 
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and Dag 2) = Z fer) (9.7.79) 


which again highlights the dependence of tracking abilities on 4. 





EXAMPLE 9.7.2. To study the tracking performance of the RLS algorithm, we again simulate the slowly time-varying SOE given 
in Example 9.7.1 whose parameters are repeated here: 


_| 08 
c,,(n) model parameters : ¢.(0) = 0.95 


y(n) ~ WGN(0, R,) R, = (0.01977 
Signal x(n) parameters: x(n)~ WGN(0, R) R=I 
Noise v(m) parameters: v(n) ~ WGN(0,o7) o, =0.1 


| M=2 p=0.999 


For these values, the degree of nonstationarity is 77(mn) = 0.1414 , which means that the RLS can track the time variations of 
the SOE. 
Three different adaptations (slow, matched, and fast) of the RLS algorithm were designed. Their adaptation results are shown 
in Figures 9.46 through 9.51. From (9.7.75) and (9.7.77), the optimum misadjustment performance is obtained when 
Aye =0.9 with Min (°°) =0.2 
while from (9.7.78) and (9.7.79), the optimum deviation performance is obtained when 
Aon = 0.93 with Zin (°°) = 0.007 
Hence, the following values for A were selected for simulation: 
Slow : A=0.99 
Matched : A=0.9 
Fast : A=0.5 


Figure 9.46 shows the matched adaptation of parameter coefficients while Figure 9.47 shows the resulting A(n) and —y(n). 
Clearly, the RLS tracks the varying coefficients nicely with expected small misregistration and deviation errors. Figure 9.48 shows 
the slow adaptation of parameter coefficients while Figure 9.49 shows the resulting A(n) and y(n). In this case, although the 
RLS algorithm tracks with bounded error variance, the tracking is not very good and the resulting misregistration errors are large: 
Finally, Figure 9.50 shows the fast adaptation of parameter coefficients while Figure 9.51 shows the resulting A(n) and y(n). 
In this case, although the algorithm is able to keep track of the slowly varying coefficients, the resulting variance is large and hence 
the estimation errors are large. Once again, the total errors are large compared to those for the matched case. 
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FIGURE 9.46 
Matched adaptation of slowly time-varying parameters: RLS algorithm with 2=0.9. 
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FIGURE 9.47 
Learning curves of RLS algorithm for matched adaptation. 


Tracking of c, (7) 





Tracking of c, 2(n) 





0 100 200 300 400 500 


FIGURE 9.48 
Slow adaptation of slowly time-varying parameters: RLS algorithm with 4 = 0.99 . 
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FIGURE 9.49 
Learning curves of RLS algorithm for slow adaptation. 
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FIGURE 9.50 
Fast adaptation of slowly time-varying parameters: RLS algorithm with 2 = 0.5. 
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FIGURE 9.51 
Learning curves of RLS algorithm for fast adaptation. 


9.7.5. Comparison of Tracking Performance 


When the optimum filter drifts like a random walk with small increment variance o,, the tracking 
performance for the LMS algorithm is given by (9.7.54) and (9.7.62) while that for the RLS algorithm is given by 
(9.7.74) and (9.7.77). Whether the LMS or the RLS algorithm is better depends on matrices R and R,. A general 
comparison is difficult to make, but some guidelines have been developed for particular cases. It has been shown that 
(Haykin 1996) 


e When R, = 03I , then both the LMS and RLS algorithms produce essentially the same minimum levels of MSD 
and misadjustment. However, this analysis is true only asymptotically and for slowly varying parameters (small 
oz). 

e When R,=QR where @ is a constant, then the LMS algorithm produces smaller values of the minimum 
levels of MSD and misadjustment than the RLS algorithm does. 

e When R, =R” where f isa constant, then the RLS algorithm is better than the LMS algorithm in producing 
the smaller values of the minimum levels of MSD and misadjustment. 


In summary, we should state that in practice the comparison of the acquisition and tracking performance of LMS 
and RLS adaptive filters is a very complicated subject. Although the previous analysis provides some insight only 
extensive simulations in the context of a specific application can help to choose the appropriate algorithm. 


9.8 Summary 


In this chapter we discussed the theory of operation, design, performance evaluation, implementation, and 
applications of adaptive filters. The most significant attribute of an adaptive filter is its ability to incrementally adjust 
its coefficients so as to improve a predefined criterion of performance over time. 

We basically developed and analyzed two families of adaptive filtering algorithms: 


e The family of LMS FIR adaptive filters, which are based on a stochastic version of the steepest-descent 
optimization algorithm. 

e The family of RLS FIR adaptive filters, which are based on a stochastic version of the Newton-type optimization 
algorithms. 

Both types of approaches can be used to develop adaptive algorithms for direct-form and lattice-ladder FIR filter 
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structures. 


For LMS adaptive filters we focused on direct-form structures because those are the most widely used and 


studied. However, we briefly discussed transform-domain and subband implementations because they offer a viable 
solution for applications that require adaptive filters with very long impulse responses. 


All RLS FIR adaptive filters discussed in this chapter exhibit identical performance if they are implemented 


using infinite-precision arithmetic. The LMS algorithm (Section 9.4), the CRLS algorithm (Section 9.5), the fast RLS 
algorithms in Section 9.6 can be used only for FIR filtering and prediction applications. The steady-state performance 
of LMS and RLS algorithms in a stationary environment is discussed in Sections 9.4 and 9.5, whereas their tracking 
performance in a nonstationary environment is analyzed in Section 9.7. 


The treatment of adaptive filters in this chapter has been quite extensive, in both number of topics and depth. 


However, the following important topics have been omitted: 
e IIR adaptive filters (Treichler et al. 1987; Johnson 1984; Shynk 1989; Regalia 1995; Netto et al. 1995; Williamson 


1998). Although adaptive IIR filters have the potential to offer the same performance as FIR filters with less 
computational complexity, they are not widely used in practical applications. The main reasons are related to the 
nonquadratic nature of their performance error surface (see Section 5.2) and the additional stability problems 
caused by the presence of poles in their system function. 

Adaptive filters using nonlinear filtering structures and neural networks (Grant and Mulgrew 1995; Haykin 1996; 
Mathews 1991). The need for such filters arises in applications involving nonlinear input-output relationships, 
nonlinear detectors (e.g., data equalization), and non-Gaussian or impulsive noise. The optimization required in 
some of these cases can be performed using genetic optimization algorithms (Tang et al. 1996). 


FIR direct-form and lattice-ladder LS adaptive filters for multichannel signals (Slock 1993; Ling 1993b; 


Carayannis et al. 1986). 


Problems 


9.1 


9.2 


9.3 


9.4 


Consider the process x(n) generated using the AR(3) model 
x(n) = —0.729x(n —3) + @(n) 
where @(n) ~ WGN(O,1). We want to design a linear predictor of x(n) using the SDA algorithm. Let 
j(n) = X(n) =c, ,x(n—1) +c, .x(n—2) +c, ,x(n—3) 
(a) Determine the 3X3 autocorrelation matrix R of x(n), and compute its eigenvalues {A,}}.,. 
(b) Determine the 3x1 cross-correlation vector d. 
(c) Choose the step size yy so that the resulting response is overdamped. Now implement the SDA 


C, = [Cki C2 Gal =c,_,+2u(d —Re,_,) 


and plot the trajectories of {c,;}}, as a function of k. 
(d) Repeat part (c) by choosing jy so that the response is underdamped. 


In the SDA algorithm, the index k is an iteration index and not a time index. However, we can treat it as a time index and use the 
instantaneous filter coefficient vector ¢, to filter data at n=k. This will result in an asymptotically optimum filter whose 
coefficents will converge to the optimum one. Consider the process x(n) given in Problem 9.1. 

(a) Generate 500 samples of x(n) and implement the asymptotically optimum filter. Plot the signal (n). 

(b) Implement the optimum filter c, on the same sequence, and plot the resulting }(n). 

(c) Comment on the above two plots. 


Consider the AR(2) process x(n) given in Example 9.3.1. We want to implement the Newton-type algorithm for faster 
convergence using 


c, =¢,_,-HR'VP(c,_,) 
(a) Using a, =—1.5955 and a, =0.95, implement the above method for H=0.1 and ce, =0. Plot the locus of Cy Versus 


Ck,2: 
(b) Repeat part (a), using a; =—0.195 and a, = 0.95. 
(c) Repeat parts (a) and (b), using the optimum step size for y that results in the fastest convergence. 


Consider the adaptive linear prediction of an AR(2) process x(n) using the LMS algorithm in which 
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x(n) =0.95x(n —1) —0.9x(n —2) + a(n) 
where @(n) ~ WGN(0, 02) . The adaptive predictor is a second-order one given by a(n) = [a (n) a(n) - 
(a) Implement the LMS algorithm given in Table 9.3 as a MATLAB function 
[a,e] =lplms (x, y,mu,M, a0) 
which computes filter coefficients in c and the corresponding error in e, given signal x, desired signal, y, step size mu, filter 
order M, and the initial coefficient vector a0. 
(b) Generate 500 samples of x(n) , and obtain linear predictor coefficients using the above function. Use step size jz so that the 
algorithm converges in the mean. Plot predictor coefficients as a function of time along with the true coefficients. 
(c) Repeat the above simulation 1000 times to obtain the learning curve, which is obtained by averaging the squared error | e(n) E 
Plot this curve and compare its steady-state value with the theoretical MSE. 


9.5 Consider the adaptive echo canceler given in Figure 9.25. The FIR filter c,(n) is given by 


c,(n) = (0.9)" O<n<2 

In this simulation, ignore the far-end signal u(n). The data signal x(n) is a zero-mean, unit-variance white Gaussian process, 

and y(n) is its echo. 

(a) Generate 1000 samples of x(n) and determine y(n). Use these signals to obtain a fourth-order LMS echo canceler in which 
the step size 44 is chosen to satisfy (9.4.40) and ¢(0) = 0 . Obtain the final echo canceler coefficients and compare them with 
the true ones. 

(b) Repeat the above simulation 500 times, and obtain the learning curve. Plot this curve along with the actual MSE and comment 
on the plot. 

(c) Repeat parts (a) and (b), using a third-order echo canceler. 

(d) Repeat parts (a) and (b), using one-half the value of jy used in the first part. 


9.6 The normalized LMS (NLMS) algorithm is given in (9.4.67), in which the effective step size is time-varying and is given by 


ÄI)? where 0< fi<1. 

(a) Modify the function firlms to implement the NLMS algorithm and obtain the function 
[c,e]=nfirlms (x, y, mu, M, c0) 

(b) Choose ñ =0.1 and repeat Problem 9.4. Compare your results in terms of convergence speed. 


(c) Choose ñ% =0.1 and repeat Problem 9.5(a) and (b). Compare your results in terms of convergence speed. 





9.7 Another variation of the LMS algorithm is called the sign-error LMS algorithm, in which the coefficient update equation is given 


9.8 


by 
e(n) =c(n—1)+2usgn[e(n)]x(n) 
1 Refe(n)]>0 
sgn[e(n)]=4 0 Re[e(n)]=0 
-l Refe(n)]<0 


where 


The advantage of this algorithm is that the multiplication is replaced by a sign change, and if {2 is chosen as a negative power of 
2, then the multiplication is replaced by a shifting operation that is easy and fast to implement. Furthermore, since 
sgn(x) = x/ | x |, the effective step size j is inversely proportional to the magnitude of the error. 
(a) Modify the function firlms to implement the sign-error NLMS algorithm and obtain the function 

[c, e]=sefirlms(x, y, mu, M, c0) 
(b) Repeat Problem 9.4 and compare your results in terms of convergence speed. 
(c) Repeat Problem 9.5(a) and (b) and compare your results in terms of convergence speed. 


Consider an AR(1) process x(n) =ax(n—1)+@(n), where an) ~ WGN(0, 02). We wish to design a one-step first-order 
linear predictor using the LMS algorithm 
X(n) = 4(n—-1) x(n-1) 
e(n) = x(n) — X(n) 
a(n) = 4(n—-1)+2 pe(n) x(n-1) 
where y is the adaptation step size. 
(a) Determine the autocorrelation r,(/) , the optimum first-order linear predictor, and the corresponding MMSE. 
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(b) Using the independence assumption, first determine and then solve the difference equation for E{d(n)} - 
(c) For a=+0.95, 4 =0.025, 02 =1, and 0<n< N =500, determine the ensemble average of E{G(n)} using 200 
independent runs and compare with the theoretical curve obtained in part (b). 


(d) Using the independence assumption, first determine and then solve the difference equation for P(n) = E{e?(n)}. 
(e) Repeat part (c) for P(n) and comment upon the results. 


9.9 Using the a posteriori error ¢(n) = y(n) —c"(n)x(n), derive the coefficient updating formulas for the a posteriori error LMS 
algorithm. Note: Refer to Equations (9.2.20) to (9.2.22). 
9.10 Solve the interference cancelation problem described in Example 5.4.1, using the LMS algorithm, and compare its performance to 
that of the optimum canceler. 


9.11 Repeat the convergence analysis of the LMS algorithm for the complex case, using formula (9.4.27) instead of (9.4.28). 


9.12 Consider the total transient excess MSE, defined by 
Fe = EO 
n=0 


in Section 9.4.3. 

(a) Show that PU) can be written as Pi = AT(T — B)'A@(0), where AQ(0) is the initial (i.e., at n=0) deviation of 
the filter coefficients from their optimum setting. 

(b) Starting with the formula in step (a), show that 


$ AGO) 
pow l i=l 1-244; 
j -S HA, 

i=l 1-24, 


(c) Show that if yA, <1, then 


4A60.(0 

wa g 

iat) =x — i lm )AG(0) 
4ul-ut(R) 4U 

which is formula (9.4.62), discussed in Section 9.4.3. 


M-I 
9.13 The frequency sampling structure for the implementation of an FIR filter H(z) = >. h(n)-z" is specified by the following 
n=0 


relation 
1-z™ %3 (et) 
nas- 2.4m a H(z) H,(z) 


where H(z) isa comb filter with M zeros equally spaced on the unit circle and H(z) is a filter bank of resonators. Note 
that A(k) Ê H(e?"™), the DFT of {h(n)}¥-', provides the coefficients of the filter. Derive an LMS-type algorithm to 
update these coefficients, and sketch the resulting adaptive filter structure. 


9.14 There are applications in which the use of a non-MSE criterion may be more appropriate. To this end, suppose that we wish to 
design and study the behavior of an “LMS-like” algorithm that minimizes the cost function P® = E{e**(n)},k =1,2,3,---, 


using the model defined in Figure 9.19. 
(a) Use the instantaneous gradient vector to derive the coefficient updating formula for this LMS-like algorithm. 
(b) Using the assumptions introduced in Section 9.4.2 show that 
E{é(n)} =U —2yk(2k - Efe,“ (n) }RIE{E(n -1)} 
where R is the input correlation matrix. 
(c) Show that the derived algorithm converges in the mean if 
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1 


0< 2u <—— 
eee SKOK- DE (ny), 


where Ama; is the largest eigenvalue of R. 
(d) Show that for k =1 the results in parts (a) to (c) reduce to those for the standard LMS algorithm 


9.15 Consider the noise cancelation system shown in Figure 9.6. The useful signal is a sinusoid s(n) =cos(ajn + g), where @ = 7/16 
and the phase ø is a random variable uniformly distributed from 0 to 27 


. The noise signals are given by 
v,(n) =0.9v,(n—1)+ @(n) and v,(n) =—0.75v,(n—1)+ a(n), where the sequences a(n) are WGN(0, 1) 


(a) Design an optimum filter of length M and choose a reasonable value for M, by plotting the MMSE as a function of M. 
(b) Design an LMS filter with M, coefficients and choose the step size y to achieve a 10 percent misadjustment. 


(c) Plot the signals s(n), s(n)+v,(n), v2(n), the clean signal e,(n) using the optimum filter, and the clean signal e,,.(n) 
_ using the LMS filter, and comment upon the obtained results 


9.16 A modification of the LMS algorithm, known as the momentum LMS (MLMS), is defined by 


c(n) =c(n—1)+ 2ye’(n)x(n) + a[e(n—-1)—c(n—2)] 
where |@|<1 (Roy and Shynk 1990). 


(a) Rewrite the previous equation to show that the algorithm has the structure of a low-pass (0<@<1) or a high-pass 
(-l1<a@<0) filter. 


(b) Explain intuitively the effect of the momentum term @{c(n—1)—c(n—2)] on the filter’s convergence behavior. 


(c) Repeat the computer equalization experiment in Section 9.4.4, using both the LMS and the MLMS — for the following 
cases, and compare their performance: 


1. W =3.1, Lams = Hmm = 9-01, œ =0.5. 

i. W =3.1, 44m, = 0.04, Hnum = 0.01, @=0.5- 
iii. W =3.1, Ams = Hmm = 0-04, œ =0.2. 
iv. W =4, hms = Umims = 9-03, @=0.3- 


9.17 In Section 9.4.5 we presented the leaky LMS algorithm [see (9.4.88)] 
c(n) = (1—a@y)e(n—1) + Le" (n) x(n) 
where 0<a@<1 is the leakage coefficient. 


(a) Show that the coefficient updating equation can be obtained by minimizing 


P(n) =| e(n) K +alle? 
(b) Using the independence assumptions, show that 
E{e(n)}=[I -UR + al )]E{e(n-1)}+ yud 
where R= E{x(n)x” (n)} and d = E{x(n)y*(n)}- 
(c) Show that if O< 4< 2/(œ + Ana)» Where Ama; is the maximum eigenvalue of R, then 


lim E{c(n)}=(R+al)'d 


that is, in the steady state E{c(co)}#c,=Rd- 


9.18 There are various communications and speech signal processing applications that require the use of filters with linear phase 
(Manolakis et al. 1984). For simplicity, assume that m is even 
(a) Derive the normal equations for an optimum FIR filter that satisfies the constraints 
i: e- Jc‘? (linear phase) 


ii. ¢® = ge‘ (constant group delay). 


(b) Show that the obtained optimum filters can be expressed as çP) 


=, + Jem) and Pd =3(, = Jem)’ where C, is 


the unconstrained optimum filter. 
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(c) Using the results in part (b) and the algorithm of Levinson, derive lattice-ladder structure for the constrained optimum filters. 

(d) Repeat parts (a), (b), and (c) for the linear predictor with linear phase, which is specified by a) Jay x 

(e) Develop an LMS algorithm for the linear-phase filter C CP) Je (ip) and sketch the resulting structure. Can you draw any 
conclusions regarding the step size and the misadjustment of this filter compared to those of the unconstrained LMS 
algorithm? 


9.19 In this problem, we develop and analyze by simulation an LMS-type adaptive lattice predictor introduced in Griffiths (1977). We 
consider the all-zero lattice filter defined in (6.5.7), which is completely specified by the lattice parameters {k,,}-!. The input 
signal is assumed wide-sense stationary. 

(a) Consider the cost function 
Py = Efle,(n)P +e (n)P} 
which provides the total prediction error power at the output of the mth stage, and show that 
dP” 
Ok; 


m-l 





=2E{el*(n)e?_,(n—1) +e!" (nje? (n)} 


m: 


(b) Derive the updating formula 
k,,(n) =k,,(n —1) —2(n)[eh" (n)e?_,(n—1) +e!" (ne? (n)] 
where the normalized step size y(n) = i /E>_,(n) is computed in practice by using the formula 
E> _.(n)=a@E?_(n-1)+(1-a)ilel_(m) P +e (n1) PI 
where 0 <@ <1. Explain the role and proper choice of œ , and determine the proper initialization of the algorithm. 
(c) Write a MATLAB function to implement the derived algorithm, and compare its performance with that of the LMS algorithm in 
the linear prediction problem discussed in Example 9.4.1. 


9.20 Consider a signal x(n) consisting of a harmonic process plus white noise, that is, 
x(n) = Acos(@n + ø) + an) 
where g is uniformly distributed from Oto 27 and an) ~ WGN(0,03)- 
(a) Determine the output power o? = E{ y?°(n)} of the causal and stable filter 


y(n) => h(k)x(n—k) 
k=0 


and show that we can cancel the harmonic process using the ideal notch filter 


. 1 = 
H(e™) -f mA 
0 otherwise 


Is the obtained ideal notch filter practically realizable? That is, is the system function rational? Why? 
(b) Consider the second-order notch filter 
D l+az'+z° D 
Rip poe A A £ 2 > —_ D(z) __ 
A(z) l+apz +p°z D(z/ p) 
where —1< p<] determines the steepness of the notch and a@=—2cosq@ its frequency. We fix p, and we wish to 
design an adaptive filter by adjusting a. 
i. Show that for p=l, © = A? |H (et?) P +03, and plot o? as a function of the frequency æ for œ =m6 . 
ii. Evaluate do? (a)/da and show that the minimum of O?(a) occurs for a=—2cosq. 
(c) Using a direct-form II structure for the implementation of H(z) and the property dY (z)/da = [dH(z)/da|X (z), show 
that the following relations 
5,(n) =—a(n-1)ps,(n—-1)— p’s,(n—2)+ (1— gr)s,(n—1) 
g(n)=s,(n)— ps,(n—2) 
5,(n) =—a(n—-1)ps,(n—1)— p’s,(n—2)+ x(n) 
y(n) = s,(n)+a(n—1)s,(n—1)+5,(n—2) 
a(n) = a(n—1)—2py(n)g(n) 
constitute an adaptive LMS notch filter. Draw its block diagram realization. 
(d) Simulate the operation of the obtained adaptive filter for p=0.9, @=27/6, and SNR 5 and 15 dB. Plot 
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9.21 


9.22 


9.23 


9.24 


9.25 


9.26 


9.27 


9.28 


9.29 


9.30 


@(n) =arccos[—a(n)/2] as a function of n, and investigate the tradeoff between convergence rate and misadjustment by 
experimenting with various values of jz. 


Consider the AR(2) process given in Problem 9.4. We will design the adaptive linear predictor using the RLS algorithm. The 

adaptive predictor is a second-order one given by c(n)=[c,(n) c,(n)]' - 

(a) Develop a MATLAB function to implement the RLS algorithm given in Table 9.6 

[c, e]=rls(x, y, lambda, delta, M, c0) 
which computes filter coefficients in c and the corresponding error in e given signal x, desired signal y, forgetting factor 
lambda, initialization parameter delta, filter order M, and the initial coefficient vector c0. To update P(n), compute only 
the upper or lower triangular part and determine the other part by using Hermitian symmetry. 

(b) Generate 500 samples of x(n) and obtain linear predictor coefficients using the above function. Use a very small value for 6 
(for example, 0.001) and various values of 2=0.99, 0.95, 0.9, and 0.8. Plot predictor coefficients as a function of time along 
with the true coefficients foreach 4 , and discuss your observations. Also compare your results with those in Problem 9.4. 

(c) Repeat each simulation above 1000 times to get corresponding learning curves, which are obtained by averaging respective 
squared errors |e(n) F . Plot these curves and compare their steady-state value with the theoretical MSE. 


Consider a system identification problem where we observe the input x(n) and the noisy output y(n)= y,(n)+v(n), for 
0< n <N -1 . The unknown system is specified by the system function 


0.0675 +0.1349z~' +0.0675z° 
1—1.1430z™' +0.4128z” 

and x(n)~ WGN(0,1), v(n) ~ WGN(0,0.01), and N =300. 

(a) Model the unknown system using an LS FIR filter, with M =15 coefficients, using the no-windowing method. Compute 
the total LSE Æ, inthe interval ny <n <N-—1 for nyo =20. 

(b) Repeat part (a) for O <n <n,—1 (do not compute Æ). Use the vector c(nọ) and the matrix Pinja R'm ) to 


H,(z)= 


initialize the CRLS algorithm. Compute the total errors E,» Sy e(n) and Ea ay e(n) by running the CRLS 
n=no N=Ng 
for n,Sn<N-1. 
(c) Order the quantities F, , Eo Eat by size and justify the resulting ordering. 
Prove Equation (9.5.25) using the identity det(J, + AB) = det(, + BA), where identity matrices J, and J, and matrices A and 
B have compatible dimensions. Hint: Put (9.5.7) inthe form J,+AB. 


Derive the normal equations that correspond to the minimization of the cost function (9.5.36), and show that for 6=0 they are 
reduced to the standard set (9.5.2) of normal equations. For the situation described in Problem 9.22, run the CRLS algorithm for 
various values of § and determine the range of values that provides acceptable performance. 


Modify the CRLS algorithm in Table 9.6 so that its coefficients satisfy the linear-phase constraint ¢ = Jc". For simplicity, assume 
that M = 2L; that is, the filter has an even number of coefficients. 


Following the approach used in Section 6.5.1 to develop the structure shown in Figure 6.1, derive a similar structure based on the 
Cholesky (not the LDL”) decomposition. 


Show that the partitioning (9.6.3) of (n) to obtain the same partitioning structure as (9.6.2) is possible only if we apply the 


prewindowing condition x, (—1) =0. What is the form of the partitioning if we abandon the prewindowing assumption? 
Derive the normal equations and the LSE formulas given in Table 9.8 for the FLP and the BLP methods. 


Derive the FLP and BLP a priori and a posteriori updating formulas given in Table 9.9. 


Modify Table 9.11 for the FAEST algorithm, to obtain a table for the FTF algorithm, and write a MATLAB function for its 


9.31 


9.32 


9.33 


9.34 


9.35 


9.36 
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implementation. Test the obtained function, using the equalization experiment in Example 9.5.2. 


If we wish to initialize the fast RLS algorithms (fast Kalman, FAEST, and FTF) using an exact method, we need to collect a set of 

data {x(n), y(n)}o forany m >M . 

(a) Identify the quantities needed to start the FAEST algorithm at n = n. Form the normal equations and use the LDL” 
decomposition method to determine these quantities. 

(b) Write a MATLAB function faestexact .m that implements the FAEST algorithm using the exact initialization procedure 
described in part (a). 

(c) Use the functions faest.m and faestexact .m to compare the two different initialization approaches for the FAEST 
algorithm in the context of the equalization experiment in Example 9.5.2. Use ny =1.5M and ny) =3M . Which value of 
ô gives results closest to the exact initialization method? 


Using the order-recursive approach introduced in Section 6.3.1, develop an order-recursive algorithm for the solution of the normal 
equations (9.5.2), and check its validity by using it to initialize the FAEST algorithm, as in Problem 9.31. Note: In Section 6.3.1 
we could not develop a closed-form algorithm because some recursions required the quantities b„(n—1) and E’(n—1). Here 
we can avoid this problem by using time recursions. 


In this problem we discuss several quantities that can serve to warn of ill behavior in fast RLS algorithms for FIR filters. 
(a) Show that the variable 
a Gn (n) _ AEZ(n-1) _ 


i= (m+1) be 
zz.) E° (n) on (Ne, (n) 


Mn (7) 


satisfies the condition 0 < 77,,(n) <1. 


(b) Prove the relations 
a(n) = Am RnD pt (yy = ER prog = Raa) 
det ĝ„(n) det ĝ„(n—1) det ĝ,(n) 
(c) Show that 
E; (n) 
= an —_ = 
Sal TE 


and use it to explain why the quantity 72%(n) = E',(n)—A"E’(n) can be used as a warning variable. 
(d) Explain how the quantities 


4 —(M+1) o en) | 
m(n) gya (n) ea 
i m(n) = e°(n)-4E”(n-1)8g ua (n) 


can be used as warning variables. 


When the desired response is y( j) = 6( j—k), that is, a spike atj=k, 0< k <n, the LS filter c% is known as a spiking filter 
or as an LS inverse filter (see Section 7.3). 
(a) Determine the normal equations and the LSE E\)(n) for the LS filter c. 
(b) Show that oc =g,,(n) and E” (n) =@,„(n) and explain their meanings. 


(c) Use the interpretation @ (n) =E (n) to show that 0<q@,(n)<1. 
(d) Show that g (n)= cV —1)x(k) and explain its meaning. 
k=0 


Derive Equations (9.6.33) through (9.6.35) for the a posteriori LS lattice-ladder structure, shown in Figure 9.37, starting with the 
partitionings (9.6.1) and the matrix by inversion by partitioning relations (9.6.7) and (9.6.8). 


Prove relations (9.6.45) and (9.6.46) for the updating of the ladder partial correlation coefficient 8$ (n). 
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In Section 6.3.1 we derived order-recursive relations for the FLP, BLP, and FIR filtering MMSEs. 

(a) Following the derivation of (6.3.36) and (6.3.37), derive similar order-recursive relations for Ef(n) and E®(n). 

(b) Show that we can obtain a complete LS lattice-ladder algorithm by replacing, in Table 9.12, the  time-recursive updatings of 
Ef(n) and Eb(n) with the obtained order-recursive relations. 

(c) Write a MATLAB function for this algorithm, and verify it by using the equalization experiment in Example 9.5.2. 


Derive the equations for the a priori RLS lattice-ladder algorithm given in Table 9.13, and write a MATLAB function for its 
implementation. Test the function by using the equalization experiment in Example 9.5.2. 


Derive the equations for the a priori RLS lattice-ladder algorithm with error feedback (see Table 9.7), and write a MATLAB function 
for its implementation. Test the function by using the equalization experiment in Example 9.5.2. 


Derive the equations for the a posteriori RLS lattice-ladder algorithm with error feedback (Ling et al. 1986) and write a MATLAB 
function for its implementation. Test the function by using the equalization experiment in Example 9.5.2. 


The a posteriori and the a priori RLS lattice-ladder algorithms need the conversion factor œ„(n) because the updating of the 
quantities E% (n), E} (n), a(n), and e(n) requires both the a priori and a posteriori errors. Derive a double (a priori and a 
posteriori) lattice-ladder RLS filter that avoids the use of the conversion factor by updating both the a priori and the a posteriori 
prediction and filtering errors. 


The implementation of adaptive filters using multiprocessing involves the following steps: (1) partitioning of the overall 
computational job into individual tasks, (2) allocation of computational and communications tasks to the processors, and (3) 
synchronization and control of the processors. Figure 9.52 shows a cascade multiprocessing architecture used for adaptive filtering. 
To avoid latency (i.e., a delay between the filter’s input and output that is larger than the sampling interval), each processor should 
complete its task in time less than the sampling period and use results computed by the preceding processor and the scalar 
computational unit at the previous sampling interval. This is accomplished by the unit delays inserted between the processors. 

(a) Explain why the fast Kalman algorithm, given in Table 9.10, does not satisfy the multiprocessing requirements. 

(b) Prove the formulas 


b(n) = POD = Byes") W 
1- gin me” (n) 
gn) = gu- gua bln) (1) 
and show that they can be used to replace formulas (g) and (h) in Table 10. 

(c) Rearrange the formulas in Table 9.10 as follows: (e), (k) (D, (a) (b), (c), A), (d), Replace n by n—-1 in (e), (1), 
and (k). Show that the resulting algorithm complies with the multiprocessing architecture shown in Figure 9.52. 

(d) Draw a block diagram of a single multiprocessing section that can be used in the multiprocessing architecture shown in Figure 
10.52. Each processor in Figure 9.52 can be assigned to execute one or more of the designed sections. Note: You may find 
useful the discussions in Lawrence and Tewksbury (1983) and in Manolakis and Patel (1992). 

(e) Figure 9.53 shows an alternative implementation of a multiprocessing section that can be used in the architecture of Figure 9.52. 
Identify the input-output quantities and the various multiplier factors. 








FIGURE 9.52 
Cascade multiprocessing architecture for the implementation of FIR adaptive filters. 
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FIGURE 9.53 
Section for the multiprocessing implementation of the fast Kalman algorithm. 


9.43 Repeat Problem 9.48 for the FAEST algorithm shown in Table 9.11. 


9.44 Show that the a priori RLS linear prediction lattice (i.e., without the ladder part) algorithm with error feedback complies with the 
multiprocessing architecture of Figure 9.52. Explain why the addition of the ladder part violates the multiprocessing architecture. 
Can we rectify these violations? (See Lawrence and Tewksbury 1983.) 


9.45 The fixed-length sliding window RLS algorithm is given in (9.7.4) through (9.7.10). 
(a) Derive the above equations of this algorithm (see Manolakis et al. 1987). 
(b) Develop a MATLAB function to implement the algorithm 
[c,e]=slwrls(x, y, L, delta, M, c0); 
where L is the fixed length of the window. 
(c) Generate 500 samples of the following nonstationary process 


@n) +0.95x(n —1) —0.9x(n — 2) O0<n< 200 
x(n) = 4 a(n) —0.95x(n —1) —0.9x(n — 2) 200 <n < 300 
an) +0.95x(n —1) —0.9x(n —2) n 2 300 


where œ(n) is a zero-mean, unit-variance white noise process. We want to obtain a second-order linear predictor using 
adaptive algorithms. Use the sliding window RLS algorithm on the data and choose L=50 and 100. Obtain plots of the 
filter coefficients and mean square error. 


(d) Now use the growing memory RLS algorithm by choosing 4 =1. Compare your results with the sliding-window RLS 
algorithm. 


(e) Finally, use the exponentially growing memory RLS by choosing 4=(L—1)/(L+1) that produces the same MSE. 
Compare your results. 


9.46 Consider the definition of the MSD A(n) in (9.2.29) and that of the trace of a matrix. 
(a) Show that D(n) =tr{®(n)}, where @(n) is the correlation matrix of €(n). 
(b) For the evolution of the correlation matrix in (9.7.58), show that 


tr(R™ 
D(co) = UM o $ TR R) 
4u 
9.47 Consider the analysis model given in Figure 9.39. Let the parameters of this model be as follows: 


0.9 
c,(n) model parameters: € (0) = F A M =2 p=0.95 


y(n)~WGN(O,R,) —-R, =(0.01)°T 
Signal x(n) parameters: x(n) ~ WGN(0,R) R=I1 
Noise v(n) parameters:  v(n)~WGN(0,02) a, = 0.1 
Simulate the system, using three values of yy that show slow, matched, and optimum adaptations of the LMS algorithm. 


(a) Obtain the tracking plots similar to Figure 9.40 for each of the above three adaptations. 
(b) Obtain the learning curve plots similar to Figure 9.41 for each of the above three adaptations. 
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9.48 Consider the analysis model given in Figure 9.39. Let the parameters of this model be as follows 
c,(n) model parameters: ¢ (0) = | 09 | M=2 p=0.95 


y(n) ~ WGN(0, R,) R, = (0.01)? 
Signal x(n) parameters: x(n) ~ WGN(0, R) R=I 
Noise v(n) parameters:  v(n)~WGN(0,02) 9, =0.1 
Simulate the system, using three values of y that show slow, matched, and optimum adaptations of the RLS algorithm. 
(a) Obtain the tracking plots similar to Figure 9.46 for each of the above three adaptations. 


(b) Obtain the learning curve plots similar to Figure 9.47 for each of the above three adaptations. 
(c) Compare your results with those obtained in Problem 9.53. 


9.49 Consider the time-varying adaptive equalizer shown in Figure 9.54 in which the time variation of the channel impulse response is 
given by 


h(n) = ph(n-1)+J1- p n(n) 


with p=0.95 mn) ~WGN(0, V10) —-h(0) = 0.5 

Let the equalizer be a single-tap equalizer and v(n)~WGN(0, 0.1). 

(a) Simulate the system for three different adaptations; that is, choose y for slow, matched, and fast adaptations of the LMS 
algorithm. 

(b) Repeat part (a), using the RLS algorithm. 





Data 
generator 





FIGURE 9.54 
Adaptive channel equalizer system with time-varying channel in Problem 9.55. 


